WebScraper MCP Server

WebScraper MCP Server

An MCP server for advanced web scraping with automatic JavaScript rendering support using Playwright. It enables scraping pages, extracting links and images, and capturing screenshots.

Category
Visit Server

README

WebScraper MCP Server v2.0 (Playwright Edition)

A Model Context Protocol (MCP) server that provides advanced web scraping and HTML to Markdown conversion using Microsoft Playwright. This version automatically detects and handles JavaScript-rendered pages.

šŸ†• What's New in v2.0

  • šŸš€ Microsoft Playwright - Superior JavaScript rendering with automatic fallback
  • šŸŽÆ Smart Detection - Automatically switches to JS rendering when needed
  • šŸ“ø Screenshots - Capture page screenshots as base64
  • ā±ļø Custom Waits - Wait for specific selectors or time periods
  • šŸ”„ Dual Mode - Static scraping for speed, JS rendering for dynamic content
  • šŸ“Š Performance Metrics - Track load times and render methods

Features

Core Capabilities

  • 🌐 Intelligent Web Scraping: Automatic detection of static vs dynamic pages
  • šŸ“ HTML to Markdown: Clean, well-formatted Markdown conversion
  • šŸŽ­ JavaScript Rendering: Full Playwright support for SPA and dynamic content
  • šŸ”— Link Extraction: Extract all hyperlinks with filtering options
  • šŸ–¼ļø Image Extraction: Extract images including lazy-loaded ones
  • šŸ“¦ Batch Processing: Scrape up to 10 URLs simultaneously
  • šŸŽÆ Metadata Extraction: Title, description, author, keywords, and more
  • āš™ļø Flexible Options: Control timeouts, redirects, content inclusion
  • šŸ“Š Multiple Formats: Output in Markdown or JSON
  • šŸ“ø Screenshot Capture: Get base64 screenshots of pages

Rendering Modes

  1. Static Mode (Default, Fast)

    • Uses Axios + Cheerio
    • Suitable for traditional HTML pages
    • Fastest performance
  2. JavaScript Mode (Auto-detected or Forced)

    • Uses Playwright with Chromium
    • Executes JavaScript
    • Handles SPAs, lazy loading, dynamic content
    • Auto-activates when static mode returns < 50 words

Installation

# Clone or navigate to the project
cd webscraper-mcp-server-v2

# Install dependencies
npm install

# Install Playwright browsers
npm run install:browsers

# Build the project
npm run build

Usage

Running with stdio (Local)

npm start

Running with HTTP (Remote)

TRANSPORT=http PORT=3000 npm start

Available Tools

1. webscraper_scrape_page - Advanced Web Scraping

Automatically detects and handles both static and dynamic pages.

New Parameters:

  • use_javascript (boolean): Force JavaScript rendering
  • wait_for_selector (string): CSS selector to wait for
  • wait_time (number): Additional wait time in milliseconds
  • take_screenshot (boolean): Capture page screenshot

Example - Force JavaScript Rendering:

{
  "url": "https://docs.uazapi.com/endpoint/post/instance~init",
  "use_javascript": true,
  "wait_for_selector": ".content",
  "wait_time": 3000,
  "take_screenshot": true
}

Example - Auto-Detection:

{
  "url": "https://example.com/spa-app"
}

Automatically switches to JavaScript if static content is insufficient

2. webscraper_extract_links - Link Extraction

New Parameter:

  • use_javascript (boolean): Use JavaScript rendering for dynamic links

Example:

{
  "url": "https://example.com",
  "use_javascript": true,
  "filter_external": true
}

3. webscraper_extract_images - Image Extraction

New Parameter:

  • use_javascript (boolean): Extract lazy-loaded images

Example:

{
  "url": "https://example.com/gallery",
  "use_javascript": true,
  "limit": 50
}

4. webscraper_batch_scrape - Batch Operations

New Parameter:

  • use_javascript (boolean): Use JavaScript for all URLs

Example:

{
  "urls": ["https://page1.com", "https://page2.com"],
  "use_javascript": true,
  "timeout": 60000
}

Configuration

Environment Variables

  • TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio')
  • PORT: HTTP server port (default: 3000, only for HTTP transport)

Client Configuration (Claude Desktop)

{
  "mcpServers": {
    "webscraper": {
      "command": "node",
      "args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"]
    }
  }
}

Output Formats

Markdown Format (Enhanced)

# Page Title

**URL:** https://example.com
**Render Method:** javascript

**Description:** Page description
**Author:** Author Name
**Word Count:** 1500 | **Status:** 200 | **Load Time:** 2340ms

---

[Page content in Markdown...]

JSON Format (Enhanced)

{
  "url": "https://example.com",
  "title": "Page Title",
  "content": "Markdown content...",
  "renderMethod": "javascript",
  "metadata": {
    "description": "Page description",
    "wordCount": 1500,
    "loadTime": 2340,
    "screenshot": "base64..." // if requested
  }
}

Performance Comparison

Feature Static Mode JavaScript Mode
Speed ~1-3s ~3-8s
JavaScript āŒ āœ…
SPA Support āŒ āœ…
Lazy Loading āŒ āœ…
Resource Usage Low Medium
Best For Traditional HTML Modern Web Apps

Use Cases

1. Scraping JavaScript-Heavy Sites

// Site with React/Vue/Angular
{
  "url": "https://spa-site.com",
  "use_javascript": true,
  "wait_for_selector": "#root > div",
  "wait_time": 2000
}

2. Capturing Visual State

// Get screenshot along with content
{
  "url": "https://example.com/dashboard",
  "use_javascript": true,
  "take_screenshot": true
}

3. API Documentation Sites

// Like your UAZ API docs example
{
  "url": "https://docs.uazapi.com/endpoint/post/instance~init",
  "use_javascript": true,
  "wait_for_selector": ".api-content",
  "response_format": "json"
}

4. E-commerce Product Pages

// Lazy-loaded images and dynamic prices
{
  "url": "https://shop.example.com/product/123",
  "use_javascript": true,
  "wait_time": 3000
}

Troubleshooting

Playwright Issues

# Reinstall browsers
npm run install:browsers

# Check Playwright installation
npx playwright --version

Low Word Count on Dynamic Sites

Problem: Getting < 50 words from a JavaScript site?

Solution:

  • Set use_javascript: true explicitly
  • Use wait_for_selector for specific elements
  • Increase wait_time if content loads slowly

Memory Issues

Problem: Browser consuming too much memory?

Solution:

  • The browser instance is reused and shared
  • Contexts are closed after each operation
  • Consider increasing system resources for heavy usage

Advantages over Puppeteer

āœ… Better Performance: Playwright is generally faster
āœ… More Reliable: Better handling of modern web apps
āœ… Auto-waiting: Smarter element waiting
āœ… Multiple Browsers: Can use Chromium, Firefox, or WebKit
āœ… Modern APIs: Cleaner, more intuitive API
āœ… Active Development: Microsoft-backed, frequent updates

Development

Project Structure

webscraper-mcp-server-v2/
ā”œā”€ā”€ src/
│   ā”œā”€ā”€ index.ts           # Main entry point
│   ā”œā”€ā”€ types.ts           # TypeScript definitions (enhanced)
│   ā”œā”€ā”€ constants.ts       # Configuration constants
│   ā”œā”€ā”€ schemas/           # Zod validation (updated)
│   ā”œā”€ā”€ services/          # Playwright-based scraping
│   └── tools/             # MCP tool implementations
ā”œā”€ā”€ dist/                  # Compiled JavaScript
ā”œā”€ā”€ package.json           # Dependencies (with Playwright)
└── README.md

Building

npm run build

Testing

# With MCP Inspector
npx @modelcontextprotocol/inspector node dist/index.js

Limitations

  • Maximum 10 URLs for batch scraping
  • Content truncated at 100,000 characters
  • Request timeout: 1-120 seconds
  • Chromium browser required (~170MB download)
  • Supports only HTTP/HTTPS protocols
  • Requires publicly accessible URLs

Performance Tips

  1. Use Static Mode When Possible: 3-5x faster for traditional sites
  2. Batch Related URLs: More efficient than individual calls
  3. Set Appropriate Timeouts: Longer for slow sites, shorter for fast ones
  4. Use Selectors Wisely: Wait for specific elements instead of fixed times
  5. Limit Screenshot Usage: Screenshots increase response size significantly

Comparison with v1.0

Feature v1.0 (Cheerio Only) v2.0 (Playwright)
Static HTML āœ… Fast āœ… Fast
JavaScript āŒ āœ… Full Support
Auto-Detection āŒ āœ… Smart Fallback
Screenshots āŒ āœ… Base64 Output
Lazy Loading āŒ āœ… Supported
SPAs āŒ Limited āœ… Full Support

License

MIT

Contributing

Contributions welcome! Areas for improvement:

  • [ ] Support for other Playwright browsers (Firefox, WebKit)
  • [ ] PDF generation from pages
  • [ ] Advanced selector strategies
  • [ ] Request interception for blocking ads
  • [ ] Cookie management
  • [ ] Proxy support

Support

For issues or questions, please open an issue on the GitHub repository.


Made with ā¤ļø using Microsoft Playwright and Model Context Protocol

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured