Portable MCP Web Scraper

Portable MCP Web Scraper

Efficient web scraping MCP server with a two-step workflow for token-efficient content extraction. Supports previewing HTML structure, targeted scraping, and entire documentation site crawling.

Category
Visit Server

README

๐Ÿš€ Portable MCP Web Scraper

A drop-in MCP (Model Context Protocol) server for efficient web scraping with two-step workflow optimization.

โœจ Features

  • Two-Step Workflow: Get HTML structure preview first, then scrape with targeted filters
  • Token Efficient: Minimal token usage for AI analysis and decision-making
  • Clean Output: Automatically removes navigation, ads, and UI elements
  • Portable: Drop anywhere and add to your MCP servers
  • Multiple Tools: Single page, multi-page, and documentation site scraping

๐Ÿš€ Quick Start

Option 1: Automated Installation

python install.py

Option 2: Manual Setup

  1. Install dependencies:

    pip install -r requirements.txt
    
  2. Copy to your MCP servers directory:

    cp portable_mcp_scraper.py /path/to/your/mcp/servers/
    
  3. Add to your MCP configuration (e.g., ~/.cursor/mcp.json):

    {
      "mcpServers": {
        "web-scraper": {
          "command": "python",
          "args": ["/path/to/your/mcp/servers/portable_mcp_scraper.py"]
        }
      }
    }
    
  4. Restart your MCP client

๐Ÿ› ๏ธ Available Tools

1. preview_html_structure

Get a clean HTML structure preview for AI analysis.

Parameters:

  • url (string): The URL to analyze
  • max_elements (int, optional): Maximum elements to include (default: 50)

Returns: Structured HTML preview with minimal text content

2. scrape_web_content

Scrape web content with custom filtering.

Parameters:

  • url (string): The URL to scrape
  • include_tags (list, optional): HTML tags to include
  • exclude_tags (list, optional): HTML tags to exclude
  • save_to_file (bool, optional): Save content to file (default: false)
  • output_dir (string, optional): Directory to save files (default: "./scraped_content")

Returns: Clean Markdown content

3. scrape_documentation_site

Scrape an entire documentation site with intelligent crawling.

Parameters:

  • base_url (string): Base URL of the documentation site
  • max_pages (int, optional): Maximum pages to scrape (default: 10)
  • include_tags (list, optional): HTML tags to include
  • exclude_tags (list, optional): HTML tags to exclude
  • save_to_files (bool, optional): Save each page to separate file (default: true)
  • output_dir (string, optional): Directory to save files (default: "./documentation")

Returns: Summary of scraped content with file paths

๐Ÿ’ก Two-Step Workflow Benefits

  1. Step 1: Get HTML structure preview (~500-1000 tokens)
  2. Step 2: Scrape with AI-determined filters (clean, focused content)

Efficiency Gains:

  • For 100-page documentation: ~90% token reduction
  • AI only analyzes structure, not full content
  • Clean, focused output without manual filtering
  • Massive cost savings for large documentation sites

๐Ÿงช Testing

Run the test script to verify everything works:

python test_mcp_server.py

๐Ÿ“‹ Example Usage

Basic Single Page Scraping

# Get structure preview first
preview = preview_html_structure("https://cursor.com/docs")

# Then scrape with filters
content = scrape_web_content(
    url="https://cursor.com/docs",
    include_tags=["h1", "h2", "h3", "p", "div"],
    exclude_tags=["nav", "footer", "aside"],
    save_to_file=True
)

Documentation Site Scraping

# Scrape entire documentation site
summary = scrape_documentation_site(
    base_url="https://cursor.com/docs",
    max_pages=20,
    save_to_files=True,
    output_dir="./cursor_docs"
)

๐Ÿ”ง Requirements

  • Python 3.8+
  • Google Chrome browser
  • ChromeDriver (automatically managed by webdriver-manager)

๐Ÿ“ Files

  • portable_mcp_scraper.py - The main MCP server
  • fastmcp.json - FastMCP configuration
  • install.py - Automated installation script
  • test_mcp_server.py - Test script
  • requirements.txt - Python dependencies
  • USAGE_GUIDE.md - ๐Ÿ“š Complete usage guide with examples
  • PORTABLE_PACKAGE.md - Detailed package information
  • SUCCESS_SUMMARY.md - What this package provides

๐ŸŽฏ Perfect For

  • AI Agents that need to scrape documentation on-demand
  • Documentation Analysis with minimal token usage
  • Content Extraction from complex websites
  • Multi-page Scraping with intelligent crawling
  • Cost-Effective web scraping for AI workflows

๐Ÿ“„ License

MIT License - Feel free to use and modify as needed.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“ž Support

If you encounter any issues:

  1. Check that Google Chrome is installed
  2. Verify all dependencies are installed
  3. Run the test script to diagnose problems
  4. Check the documentation files for troubleshooting

Ready to drop and use! ๐ŸŽ‰

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured