MCP Servers

Scrapy MCP Server

A powerful web scraping MCP server built on Scrapy and FastMCP that supports multiple scraping methods (HTTP, Scrapy, browser automation), anti-detection techniques, form handling, and concurrent crawling. Designed for commercial environments with enterprise-grade features like intelligent retry mechanisms, performance monitoring, and configurable data extraction.

README

Scrapy MCP Server

一个基于 Scrapy 和 FastMCP 构建的强大、稳定的网页爬取 MCP Server，专为商业环境中的长期使用而设计。

系统要求: Python 3.12 或更高版本

🚀 特性

核心功能

多种爬取方法: 支持简单 HTTP 请求、Scrapy 框架和浏览器自动化
智能方法选择: 自动选择最适合的爬取方法
并发处理: 支持多个 URL 的并发爬取
配置化提取: 灵活的数据提取配置系统

高级功能

反反爬虫: 使用 undetected-chromedriver 和 Playwright 的隐身技术
表单处理: 自动填写和提交各种类型的表单
JavaScript 支持: 完整的浏览器渲染支持
智能重试: 指数退避重试机制
结果缓存: 内存缓存提升性能

企业级特性

错误处理: 完善的错误分类和处理
性能监控: 详细的请求指标和统计
速率限制: 防止服务器过载
代理支持: 支持 HTTP 代理配置
随机 UA: 防检测的用户代理轮换

📦 安装

# 确认 Python 版本 (需要 3.12+)
python --version

# 克隆仓库
git clone git@github.com:ThreeFish-AI/scrapy-mcp.git
cd scrapy-mcp

# 快速设置（推荐）
./scripts/setup.sh

# 或手动安装
# 使用 uv 安装依赖
uv sync

# 安装包括开发依赖
uv sync --extra dev

# 或者使用传统方式
pip install -e .

# 或者使用开发模式
pip install -e ".[dev]"

🔧 配置

创建 .env 文件来自定义配置：

# 服务器设置
SCRAPY_MCP_SERVER_NAME=scrapy-mcp-server
SCRAPY_MCP_SERVER_VERSION=0.1.0

# 并发和延迟设置
SCRAPY_CONCURRENT_REQUESTS=16
SCRAPY_DOWNLOAD_DELAY=1.0
SCRAPY_RANDOMIZE_DOWNLOAD_DELAY=true

# 浏览器设置
SCRAPY_MCP_ENABLE_JAVASCRIPT=false
SCRAPY_MCP_BROWSER_HEADLESS=true
SCRAPY_MCP_BROWSER_TIMEOUT=30

# 反检测设置
SCRAPY_MCP_USE_RANDOM_USER_AGENT=true
SCRAPY_MCP_USE_PROXY=false
SCRAPY_MCP_PROXY_URL=

# 重试设置
SCRAPY_MCP_MAX_RETRIES=3
SCRAPY_MCP_REQUEST_TIMEOUT=30

🚦 快速开始

启动服务器

# 使用命令行
scrapy-mcp

# 使用 uv 运行（推荐）
uv run scrapy-mcp

# 或者使用Python
python -m scrapy_mcp.server

# 使用 uv 运行 Python 模块
uv run python -m scrapy_mcp.server

MCP Client 配置

在您的 MCP client (如 Claude Desktop) 中添加服务器配置：

方式一：直接命令方式

{
  "mcpServers": {
    "scrapy-mcp": {
      "command": "scrapy-mcp",
      "args": []
    }
  }
}

方式二：通过 uv 启动（推荐）

{
  "mcpServers": {
    "scrapy-mcp": {
      "command": "uv",
      "args": ["run", "scrapy-mcp"],
      "cwd": "/Users/cm.huang/Documents/workspace/projects/aurelius/scrapy-mcp"
    }
  }
}

方式三：从 GitHub 仓库直接安装和运行

{
  "mcpServers": {
    "scrapy-mcp": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "git+https://github.com/ThreeFish-AI/scrapy-mcp.git",
        "scrapy-mcp"
      ]
    }
  }
}

方式四：指定 Git Tag 版本（推荐用于生产环境）

通过指定 git tag 可以确保使用特定版本，提高稳定性：

{
  "mcpServers": {
    "scrapy-mcp": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "git+https://github.com/ThreeFish-AI/scrapy-mcp.git@v0.1.0",
        "scrapy-mcp"
      ]
    }
  }
}

也可以指定其他标签或分支：

{
  "mcpServers": {
    "scrapy-mcp": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "git+https://github.com/ThreeFish-AI/scrapy-mcp.git@main",
        "scrapy-mcp"
      ]
    }
  }
}

方式五：Python 模块方式（使用 uv）

{
  "mcpServers": {
    "scrapy-mcp": {
      "command": "uv",
      "args": ["run", "python", "-m", "scrapy_mcp.server"],
      "cwd": "/Users/cm.huang/Documents/workspace/projects/aurelius/scrapy-mcp"
    }
  }
}

注意事项：

将路径替换为您的项目实际路径（示例路径：/Users/cm.huang/Documents/workspace/projects/aurelius/scrapy-mcp）
GitHub 仓库地址：git@github.com:ThreeFish-AI/scrapy-mcp.git
HTTPS 仓库地址：https://github.com/ThreeFish-AI/scrapy-mcp.git
项目目录名：scrapy-mcp
推荐使用方式二（uv 启动），具有更好的依赖管理和性能
方式三适合直接从 GitHub 仓库运行，无需本地克隆

🛠️ 可用工具

1. scrape_webpage

基础网页爬取工具，支持多种方法和自定义提取规则。

参数:

url: 要爬取的 URL
method: 爬取方法 (auto/simple/scrapy/selenium)
extract_config: 数据提取配置 (可选)
wait_for_element: 等待的 CSS 选择器 (Selenium 专用)

示例:

{
  "url": "https://example.com",
  "method": "auto",
  "extract_config": {
    "title": "h1",
    "content": {
      "selector": ".content p",
      "multiple": true,
      "attr": "text"
    }
  }
}

2. scrape_multiple_webpages

并发爬取多个网页。

示例:

{
  "urls": ["https://example1.com", "https://example2.com"],
  "method": "simple",
  "extract_config": {
    "title": "h1",
    "links": "a"
  }
}

3. scrape_with_stealth

使用高级反检测技术爬取网页。

参数:

url: 目标 URL
method: 隐身方法 (selenium/playwright)
extract_config: 提取配置
wait_for_element: 等待元素
scroll_page: 是否滚动页面加载动态内容

示例:

{
  "url": "https://protected-site.com",
  "method": "playwright",
  "scroll_page": true,
  "wait_for_element": ".dynamic-content"
}

4. fill_and_submit_form

表单填写和提交。

参数:

url: 包含表单的页面 URL
form_data: 表单字段数据 (选择器:值对)
submit: 是否提交表单
submit_button_selector: 提交按钮选择器
method: 方法 (selenium/playwright)

示例:

{
  "url": "https://example.com/contact",
  "form_data": {
    "input[name='name']": "John Doe",
    "input[name='email']": "john@example.com",
    "textarea[name='message']": "Hello world"
  },
  "submit": true,
  "method": "selenium"
}

5. extract_links

专门的链接提取工具。

参数:

url: 目标 URL
filter_domains: 只包含这些域名的链接
exclude_domains: 排除这些域名的链接
internal_only: 只提取内部链接

示例:

{
  "url": "https://example.com",
  "internal_only": true
}

6. extract_structured_data

自动提取结构化数据 (联系信息、社交媒体链接等)。

参数:

url: 目标 URL
data_type: 数据类型 (all/contact/social/content)

示例:

{
  "url": "https://company.com",
  "data_type": "contact"
}

7. get_page_info

快速获取页面基础信息。

示例:

{
  "url": "https://example.com"
}

8. check_robots_txt

检查网站的 robots.txt 文件。

9. get_server_metrics

获取服务器性能指标和统计信息。

10. clear_cache

清除缓存的爬取结果。

📖 数据提取配置

简单选择器

{
  "title": "h1",
  "links": "a"
}

高级配置

{
  "products": {
    "selector": ".product",
    "multiple": true,
    "attr": "text"
  },
  "prices": {
    "selector": ".price",
    "multiple": true,
    "attr": "data-price"
  },
  "description": {
    "selector": ".description",
    "multiple": false,
    "attr": "text"
  }
}

🎯 最佳实践

1. 选择合适的方法

simple: 静态内容，快速爬取
scrapy: 大规模爬取，需要高级特性
selenium: JavaScript 重度网站
stealth: 有反爬保护的网站

2. 遵守网站规则

使用 check_robots_txt 工具检查爬取规则
设置合适的延迟和并发限制
尊重网站的使用条款

3. 性能优化

使用缓存避免重复请求
合理设置超时时间
监控 get_server_metrics 调整配置

4. 错误处理

实施重试逻辑
监控错误类别
根据错误类型调整策略

🔍 故障排除

常见问题

1. Selenium/Playwright 启动失败

确保安装了 Chrome 浏览器
检查系统权限和防火墙设置

2. 反爬虫检测

使用 scrape_with_stealth 工具
启用随机 User-Agent
配置代理服务器

3. 超时错误

增加 browser_timeout 设置
检查网络连接
使用更稳定的爬取方法

4. 内存占用过高

减少并发请求数
清理缓存
检查是否有资源泄露

📊 性能指标

使用 get_server_metrics 工具监控：

请求总数和成功率
平均响应时间
错误分类统计
方法使用分布
缓存命中率

🔒 安全注意事项

不要在日志中记录敏感信息
使用 HTTPS 代理服务器
定期更新依赖包
遵守数据保护法规

🤝 贡献

欢迎提交 Issue 和 Pull Request 来改进这个项目。

📄 许可证

MIT License - 详见 LICENSE 文件

📞 支持

如遇问题请提交 GitHub Issue 或联系维护团队。

注意: 请负责任地使用此工具，遵守网站的使用条款和 robots.txt 规则，尊重网站的知识产权。

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

E2B

Using MCP to run code via e2b.

Official

Featured