MCP Servers

Portable MCP Web Scraper

Efficient web scraping MCP server with a two-step workflow for token-efficient content extraction. Supports previewing HTML structure, targeted scraping, and entire documentation site crawling.

README

🚀 Portable MCP Web Scraper

A drop-in MCP (Model Context Protocol) server for efficient web scraping with two-step workflow optimization.

✨ Features

Two-Step Workflow: Get HTML structure preview first, then scrape with targeted filters
Token Efficient: Minimal token usage for AI analysis and decision-making
Clean Output: Automatically removes navigation, ads, and UI elements
Portable: Drop anywhere and add to your MCP servers
Multiple Tools: Single page, multi-page, and documentation site scraping

🚀 Quick Start

Option 1: Automated Installation

python install.py

Option 2: Manual Setup

Install dependencies:
```
pip install -r requirements.txt
```

Copy to your MCP servers directory:

cp portable_mcp_scraper.py /path/to/your/mcp/servers/

Add to your MCP configuration (e.g., ~/.cursor/mcp.json):

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["/path/to/your/mcp/servers/portable_mcp_scraper.py"]
    }
  }
}

Restart your MCP client

🛠️ Available Tools

1. `preview_html_structure`

Get a clean HTML structure preview for AI analysis.

Parameters:

url (string): The URL to analyze
max_elements (int, optional): Maximum elements to include (default: 50)

Returns: Structured HTML preview with minimal text content

2. `scrape_web_content`

Scrape web content with custom filtering.

Parameters:

url (string): The URL to scrape
include_tags (list, optional): HTML tags to include
exclude_tags (list, optional): HTML tags to exclude
save_to_file (bool, optional): Save content to file (default: false)
output_dir (string, optional): Directory to save files (default: "./scraped_content")

Returns: Clean Markdown content

3. `scrape_documentation_site`

Scrape an entire documentation site with intelligent crawling.

Parameters:

base_url (string): Base URL of the documentation site
max_pages (int, optional): Maximum pages to scrape (default: 10)
include_tags (list, optional): HTML tags to include
exclude_tags (list, optional): HTML tags to exclude
save_to_files (bool, optional): Save each page to separate file (default: true)
output_dir (string, optional): Directory to save files (default: "./documentation")

Returns: Summary of scraped content with file paths

💡 Two-Step Workflow Benefits

Step 1: Get HTML structure preview (~500-1000 tokens)
Step 2: Scrape with AI-determined filters (clean, focused content)

Efficiency Gains:

For 100-page documentation: ~90% token reduction
AI only analyzes structure, not full content
Clean, focused output without manual filtering
Massive cost savings for large documentation sites

🧪 Testing

Run the test script to verify everything works:

python test_mcp_server.py

📋 Example Usage

Basic Single Page Scraping

# Get structure preview first
preview = preview_html_structure("https://cursor.com/docs")

# Then scrape with filters
content = scrape_web_content(
    url="https://cursor.com/docs",
    include_tags=["h1", "h2", "h3", "p", "div"],
    exclude_tags=["nav", "footer", "aside"],
    save_to_file=True
)

Documentation Site Scraping

# Scrape entire documentation site
summary = scrape_documentation_site(
    base_url="https://cursor.com/docs",
    max_pages=20,
    save_to_files=True,
    output_dir="./cursor_docs"
)

🔧 Requirements

Python 3.8+
Google Chrome browser
ChromeDriver (automatically managed by webdriver-manager)

📁 Files

portable_mcp_scraper.py - The main MCP server
fastmcp.json - FastMCP configuration
install.py - Automated installation script
test_mcp_server.py - Test script
requirements.txt - Python dependencies
USAGE_GUIDE.md - 📚 Complete usage guide with examples
PORTABLE_PACKAGE.md - Detailed package information
SUCCESS_SUMMARY.md - What this package provides

🎯 Perfect For

AI Agents that need to scrape documentation on-demand
Documentation Analysis with minimal token usage
Content Extraction from complex websites
Multi-page Scraping with intelligent crawling
Cost-Effective web scraping for AI workflows

📄 License

MIT License - Feel free to use and modify as needed.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📞 Support

If you encounter any issues:

Check that Google Chrome is installed
Verify all dependencies are installed
Run the test script to diagnose problems
Check the documentation files for troubleshooting

Ready to drop and use! 🎉

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Portable MCP Web Scraper

README

🚀 Portable MCP Web Scraper

✨ Features

🚀 Quick Start

Option 1: Automated Installation

Option 2: Manual Setup

🛠️ Available Tools

1. preview_html_structure

2. scrape_web_content

3. scrape_documentation_site

💡 Two-Step Workflow Benefits

🧪 Testing

📋 Example Usage

Basic Single Page Scraping

Documentation Site Scraping

🔧 Requirements

📁 Files

🎯 Perfect For

📄 License

🤝 Contributing

📞 Support

Recommended Servers

1. `preview_html_structure`

2. `scrape_web_content`

3. `scrape_documentation_site`