WebScraper MCP Server
An MCP server for advanced web scraping with automatic JavaScript rendering support using Playwright. It enables scraping pages, extracting links and images, and capturing screenshots.
README
WebScraper MCP Server v2.0 (Playwright Edition)
A Model Context Protocol (MCP) server that provides advanced web scraping and HTML to Markdown conversion using Microsoft Playwright. This version automatically detects and handles JavaScript-rendered pages.
š What's New in v2.0
- š Microsoft Playwright - Superior JavaScript rendering with automatic fallback
- šÆ Smart Detection - Automatically switches to JS rendering when needed
- šø Screenshots - Capture page screenshots as base64
- ā±ļø Custom Waits - Wait for specific selectors or time periods
- š Dual Mode - Static scraping for speed, JS rendering for dynamic content
- š Performance Metrics - Track load times and render methods
Features
Core Capabilities
- š Intelligent Web Scraping: Automatic detection of static vs dynamic pages
- š HTML to Markdown: Clean, well-formatted Markdown conversion
- š JavaScript Rendering: Full Playwright support for SPA and dynamic content
- š Link Extraction: Extract all hyperlinks with filtering options
- š¼ļø Image Extraction: Extract images including lazy-loaded ones
- š¦ Batch Processing: Scrape up to 10 URLs simultaneously
- šÆ Metadata Extraction: Title, description, author, keywords, and more
- āļø Flexible Options: Control timeouts, redirects, content inclusion
- š Multiple Formats: Output in Markdown or JSON
- šø Screenshot Capture: Get base64 screenshots of pages
Rendering Modes
-
Static Mode (Default, Fast)
- Uses Axios + Cheerio
- Suitable for traditional HTML pages
- Fastest performance
-
JavaScript Mode (Auto-detected or Forced)
- Uses Playwright with Chromium
- Executes JavaScript
- Handles SPAs, lazy loading, dynamic content
- Auto-activates when static mode returns < 50 words
Installation
# Clone or navigate to the project
cd webscraper-mcp-server-v2
# Install dependencies
npm install
# Install Playwright browsers
npm run install:browsers
# Build the project
npm run build
Usage
Running with stdio (Local)
npm start
Running with HTTP (Remote)
TRANSPORT=http PORT=3000 npm start
Available Tools
1. webscraper_scrape_page - Advanced Web Scraping
Automatically detects and handles both static and dynamic pages.
New Parameters:
use_javascript(boolean): Force JavaScript renderingwait_for_selector(string): CSS selector to wait forwait_time(number): Additional wait time in millisecondstake_screenshot(boolean): Capture page screenshot
Example - Force JavaScript Rendering:
{
"url": "https://docs.uazapi.com/endpoint/post/instance~init",
"use_javascript": true,
"wait_for_selector": ".content",
"wait_time": 3000,
"take_screenshot": true
}
Example - Auto-Detection:
{
"url": "https://example.com/spa-app"
}
Automatically switches to JavaScript if static content is insufficient
2. webscraper_extract_links - Link Extraction
New Parameter:
use_javascript(boolean): Use JavaScript rendering for dynamic links
Example:
{
"url": "https://example.com",
"use_javascript": true,
"filter_external": true
}
3. webscraper_extract_images - Image Extraction
New Parameter:
use_javascript(boolean): Extract lazy-loaded images
Example:
{
"url": "https://example.com/gallery",
"use_javascript": true,
"limit": 50
}
4. webscraper_batch_scrape - Batch Operations
New Parameter:
use_javascript(boolean): Use JavaScript for all URLs
Example:
{
"urls": ["https://page1.com", "https://page2.com"],
"use_javascript": true,
"timeout": 60000
}
Configuration
Environment Variables
TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio')PORT: HTTP server port (default: 3000, only for HTTP transport)
Client Configuration (Claude Desktop)
{
"mcpServers": {
"webscraper": {
"command": "node",
"args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"]
}
}
}
Output Formats
Markdown Format (Enhanced)
# Page Title
**URL:** https://example.com
**Render Method:** javascript
**Description:** Page description
**Author:** Author Name
**Word Count:** 1500 | **Status:** 200 | **Load Time:** 2340ms
---
[Page content in Markdown...]
JSON Format (Enhanced)
{
"url": "https://example.com",
"title": "Page Title",
"content": "Markdown content...",
"renderMethod": "javascript",
"metadata": {
"description": "Page description",
"wordCount": 1500,
"loadTime": 2340,
"screenshot": "base64..." // if requested
}
}
Performance Comparison
| Feature | Static Mode | JavaScript Mode |
|---|---|---|
| Speed | ~1-3s | ~3-8s |
| JavaScript | ā | ā |
| SPA Support | ā | ā |
| Lazy Loading | ā | ā |
| Resource Usage | Low | Medium |
| Best For | Traditional HTML | Modern Web Apps |
Use Cases
1. Scraping JavaScript-Heavy Sites
// Site with React/Vue/Angular
{
"url": "https://spa-site.com",
"use_javascript": true,
"wait_for_selector": "#root > div",
"wait_time": 2000
}
2. Capturing Visual State
// Get screenshot along with content
{
"url": "https://example.com/dashboard",
"use_javascript": true,
"take_screenshot": true
}
3. API Documentation Sites
// Like your UAZ API docs example
{
"url": "https://docs.uazapi.com/endpoint/post/instance~init",
"use_javascript": true,
"wait_for_selector": ".api-content",
"response_format": "json"
}
4. E-commerce Product Pages
// Lazy-loaded images and dynamic prices
{
"url": "https://shop.example.com/product/123",
"use_javascript": true,
"wait_time": 3000
}
Troubleshooting
Playwright Issues
# Reinstall browsers
npm run install:browsers
# Check Playwright installation
npx playwright --version
Low Word Count on Dynamic Sites
Problem: Getting < 50 words from a JavaScript site?
Solution:
- Set
use_javascript: trueexplicitly - Use
wait_for_selectorfor specific elements - Increase
wait_timeif content loads slowly
Memory Issues
Problem: Browser consuming too much memory?
Solution:
- The browser instance is reused and shared
- Contexts are closed after each operation
- Consider increasing system resources for heavy usage
Advantages over Puppeteer
ā
Better Performance: Playwright is generally faster
ā
More Reliable: Better handling of modern web apps
ā
Auto-waiting: Smarter element waiting
ā
Multiple Browsers: Can use Chromium, Firefox, or WebKit
ā
Modern APIs: Cleaner, more intuitive API
ā
Active Development: Microsoft-backed, frequent updates
Development
Project Structure
webscraper-mcp-server-v2/
āāā src/
ā āāā index.ts # Main entry point
ā āāā types.ts # TypeScript definitions (enhanced)
ā āāā constants.ts # Configuration constants
ā āāā schemas/ # Zod validation (updated)
ā āāā services/ # Playwright-based scraping
ā āāā tools/ # MCP tool implementations
āāā dist/ # Compiled JavaScript
āāā package.json # Dependencies (with Playwright)
āāā README.md
Building
npm run build
Testing
# With MCP Inspector
npx @modelcontextprotocol/inspector node dist/index.js
Limitations
- Maximum 10 URLs for batch scraping
- Content truncated at 100,000 characters
- Request timeout: 1-120 seconds
- Chromium browser required (~170MB download)
- Supports only HTTP/HTTPS protocols
- Requires publicly accessible URLs
Performance Tips
- Use Static Mode When Possible: 3-5x faster for traditional sites
- Batch Related URLs: More efficient than individual calls
- Set Appropriate Timeouts: Longer for slow sites, shorter for fast ones
- Use Selectors Wisely: Wait for specific elements instead of fixed times
- Limit Screenshot Usage: Screenshots increase response size significantly
Comparison with v1.0
| Feature | v1.0 (Cheerio Only) | v2.0 (Playwright) |
|---|---|---|
| Static HTML | ā Fast | ā Fast |
| JavaScript | ā | ā Full Support |
| Auto-Detection | ā | ā Smart Fallback |
| Screenshots | ā | ā Base64 Output |
| Lazy Loading | ā | ā Supported |
| SPAs | ā Limited | ā Full Support |
License
MIT
Contributing
Contributions welcome! Areas for improvement:
- [ ] Support for other Playwright browsers (Firefox, WebKit)
- [ ] PDF generation from pages
- [ ] Advanced selector strategies
- [ ] Request interception for blocking ads
- [ ] Cookie management
- [ ] Proxy support
Support
For issues or questions, please open an issue on the GitHub repository.
Made with ā¤ļø using Microsoft Playwright and Model Context Protocol
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.