Portable MCP Web Scraper
Efficient web scraping MCP server with a two-step workflow for token-efficient content extraction. Supports previewing HTML structure, targeted scraping, and entire documentation site crawling.
README
๐ Portable MCP Web Scraper
A drop-in MCP (Model Context Protocol) server for efficient web scraping with two-step workflow optimization.
โจ Features
- Two-Step Workflow: Get HTML structure preview first, then scrape with targeted filters
- Token Efficient: Minimal token usage for AI analysis and decision-making
- Clean Output: Automatically removes navigation, ads, and UI elements
- Portable: Drop anywhere and add to your MCP servers
- Multiple Tools: Single page, multi-page, and documentation site scraping
๐ Quick Start
Option 1: Automated Installation
python install.py
Option 2: Manual Setup
-
Install dependencies:
pip install -r requirements.txt -
Copy to your MCP servers directory:
cp portable_mcp_scraper.py /path/to/your/mcp/servers/ -
Add to your MCP configuration (e.g.,
~/.cursor/mcp.json):{ "mcpServers": { "web-scraper": { "command": "python", "args": ["/path/to/your/mcp/servers/portable_mcp_scraper.py"] } } } -
Restart your MCP client
๐ ๏ธ Available Tools
1. preview_html_structure
Get a clean HTML structure preview for AI analysis.
Parameters:
url(string): The URL to analyzemax_elements(int, optional): Maximum elements to include (default: 50)
Returns: Structured HTML preview with minimal text content
2. scrape_web_content
Scrape web content with custom filtering.
Parameters:
url(string): The URL to scrapeinclude_tags(list, optional): HTML tags to includeexclude_tags(list, optional): HTML tags to excludesave_to_file(bool, optional): Save content to file (default: false)output_dir(string, optional): Directory to save files (default: "./scraped_content")
Returns: Clean Markdown content
3. scrape_documentation_site
Scrape an entire documentation site with intelligent crawling.
Parameters:
base_url(string): Base URL of the documentation sitemax_pages(int, optional): Maximum pages to scrape (default: 10)include_tags(list, optional): HTML tags to includeexclude_tags(list, optional): HTML tags to excludesave_to_files(bool, optional): Save each page to separate file (default: true)output_dir(string, optional): Directory to save files (default: "./documentation")
Returns: Summary of scraped content with file paths
๐ก Two-Step Workflow Benefits
- Step 1: Get HTML structure preview (~500-1000 tokens)
- Step 2: Scrape with AI-determined filters (clean, focused content)
Efficiency Gains:
- For 100-page documentation: ~90% token reduction
- AI only analyzes structure, not full content
- Clean, focused output without manual filtering
- Massive cost savings for large documentation sites
๐งช Testing
Run the test script to verify everything works:
python test_mcp_server.py
๐ Example Usage
Basic Single Page Scraping
# Get structure preview first
preview = preview_html_structure("https://cursor.com/docs")
# Then scrape with filters
content = scrape_web_content(
url="https://cursor.com/docs",
include_tags=["h1", "h2", "h3", "p", "div"],
exclude_tags=["nav", "footer", "aside"],
save_to_file=True
)
Documentation Site Scraping
# Scrape entire documentation site
summary = scrape_documentation_site(
base_url="https://cursor.com/docs",
max_pages=20,
save_to_files=True,
output_dir="./cursor_docs"
)
๐ง Requirements
- Python 3.8+
- Google Chrome browser
- ChromeDriver (automatically managed by webdriver-manager)
๐ Files
portable_mcp_scraper.py- The main MCP serverfastmcp.json- FastMCP configurationinstall.py- Automated installation scripttest_mcp_server.py- Test scriptrequirements.txt- Python dependenciesUSAGE_GUIDE.md- ๐ Complete usage guide with examplesPORTABLE_PACKAGE.md- Detailed package informationSUCCESS_SUMMARY.md- What this package provides
๐ฏ Perfect For
- AI Agents that need to scrape documentation on-demand
- Documentation Analysis with minimal token usage
- Content Extraction from complex websites
- Multi-page Scraping with intelligent crawling
- Cost-Effective web scraping for AI workflows
๐ License
MIT License - Feel free to use and modify as needed.
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ Support
If you encounter any issues:
- Check that Google Chrome is installed
- Verify all dependencies are installed
- Run the test script to diagnose problems
- Check the documentation files for troubleshooting
Ready to drop and use! ๐
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.