MCP Servers

WebScraper MCP Server

An MCP server for advanced web scraping with automatic JavaScript rendering support using Playwright. It enables scraping pages, extracting links and images, and capturing screenshots.

README

WebScraper MCP Server v2.0 (Playwright Edition)

A Model Context Protocol (MCP) server that provides advanced web scraping and HTML to Markdown conversion using Microsoft Playwright. This version automatically detects and handles JavaScript-rendered pages.

🆕 What's New in v2.0

🚀 Microsoft Playwright - Superior JavaScript rendering with automatic fallback
🎯 Smart Detection - Automatically switches to JS rendering when needed
📸 Screenshots - Capture page screenshots as base64
⏱️ Custom Waits - Wait for specific selectors or time periods
🔄 Dual Mode - Static scraping for speed, JS rendering for dynamic content
📊 Performance Metrics - Track load times and render methods

Features

Core Capabilities

🌐 Intelligent Web Scraping: Automatic detection of static vs dynamic pages
📝 HTML to Markdown: Clean, well-formatted Markdown conversion
🎭 JavaScript Rendering: Full Playwright support for SPA and dynamic content
🔗 Link Extraction: Extract all hyperlinks with filtering options
🖼️ Image Extraction: Extract images including lazy-loaded ones
📦 Batch Processing: Scrape up to 10 URLs simultaneously
🎯 Metadata Extraction: Title, description, author, keywords, and more
⚙️ Flexible Options: Control timeouts, redirects, content inclusion
📊 Multiple Formats: Output in Markdown or JSON
📸 Screenshot Capture: Get base64 screenshots of pages

Rendering Modes

Static Mode (Default, Fast)
- Uses Axios + Cheerio
- Suitable for traditional HTML pages
- Fastest performance
JavaScript Mode (Auto-detected or Forced)
- Uses Playwright with Chromium
- Executes JavaScript
- Handles SPAs, lazy loading, dynamic content
- Auto-activates when static mode returns < 50 words

Installation

# Clone or navigate to the project
cd webscraper-mcp-server-v2

# Install dependencies
npm install

# Install Playwright browsers
npm run install:browsers

# Build the project
npm run build

Usage

Running with stdio (Local)

npm start

Running with HTTP (Remote)

TRANSPORT=http PORT=3000 npm start

Available Tools

1. `webscraper_scrape_page` - Advanced Web Scraping

Automatically detects and handles both static and dynamic pages.

New Parameters:

use_javascript (boolean): Force JavaScript rendering
wait_for_selector (string): CSS selector to wait for
wait_time (number): Additional wait time in milliseconds
take_screenshot (boolean): Capture page screenshot

Example - Force JavaScript Rendering:

{
  "url": "https://docs.uazapi.com/endpoint/post/instance~init",
  "use_javascript": true,
  "wait_for_selector": ".content",
  "wait_time": 3000,
  "take_screenshot": true
}

Example - Auto-Detection:

{
  "url": "https://example.com/spa-app"
}

Automatically switches to JavaScript if static content is insufficient

2. `webscraper_extract_links` - Link Extraction

New Parameter:

use_javascript (boolean): Use JavaScript rendering for dynamic links

Example:

{
  "url": "https://example.com",
  "use_javascript": true,
  "filter_external": true
}

3. `webscraper_extract_images` - Image Extraction

New Parameter:

use_javascript (boolean): Extract lazy-loaded images

Example:

{
  "url": "https://example.com/gallery",
  "use_javascript": true,
  "limit": 50
}

4. `webscraper_batch_scrape` - Batch Operations

New Parameter:

use_javascript (boolean): Use JavaScript for all URLs

Example:

{
  "urls": ["https://page1.com", "https://page2.com"],
  "use_javascript": true,
  "timeout": 60000
}

Configuration

Environment Variables

TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio')
PORT: HTTP server port (default: 3000, only for HTTP transport)

Client Configuration (Claude Desktop)

{
  "mcpServers": {
    "webscraper": {
      "command": "node",
      "args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"]
    }
  }
}

Output Formats

Markdown Format (Enhanced)

# Page Title

**URL:** https://example.com
**Render Method:** javascript

**Description:** Page description
**Author:** Author Name
**Word Count:** 1500 | **Status:** 200 | **Load Time:** 2340ms

---

[Page content in Markdown...]

JSON Format (Enhanced)

{
  "url": "https://example.com",
  "title": "Page Title",
  "content": "Markdown content...",
  "renderMethod": "javascript",
  "metadata": {
    "description": "Page description",
    "wordCount": 1500,
    "loadTime": 2340,
    "screenshot": "base64..." // if requested
  }
}

Performance Comparison

Feature	Static Mode	JavaScript Mode
Speed	~1-3s	~3-8s
JavaScript	❌	✅
SPA Support	❌	✅
Lazy Loading	❌	✅
Resource Usage	Low	Medium
Best For	Traditional HTML	Modern Web Apps

Use Cases

1. Scraping JavaScript-Heavy Sites

// Site with React/Vue/Angular
{
  "url": "https://spa-site.com",
  "use_javascript": true,
  "wait_for_selector": "#root > div",
  "wait_time": 2000
}

2. Capturing Visual State

// Get screenshot along with content
{
  "url": "https://example.com/dashboard",
  "use_javascript": true,
  "take_screenshot": true
}

3. API Documentation Sites

// Like your UAZ API docs example
{
  "url": "https://docs.uazapi.com/endpoint/post/instance~init",
  "use_javascript": true,
  "wait_for_selector": ".api-content",
  "response_format": "json"
}

4. E-commerce Product Pages

// Lazy-loaded images and dynamic prices
{
  "url": "https://shop.example.com/product/123",
  "use_javascript": true,
  "wait_time": 3000
}

Troubleshooting

Playwright Issues

# Reinstall browsers
npm run install:browsers

# Check Playwright installation
npx playwright --version

Low Word Count on Dynamic Sites

Problem: Getting < 50 words from a JavaScript site?

Solution:

Set use_javascript: true explicitly
Use wait_for_selector for specific elements
Increase wait_time if content loads slowly

Memory Issues

Problem: Browser consuming too much memory?

Solution:

The browser instance is reused and shared
Contexts are closed after each operation
Consider increasing system resources for heavy usage

Advantages over Puppeteer

✅ Better Performance: Playwright is generally faster
✅ More Reliable: Better handling of modern web apps
✅ Auto-waiting: Smarter element waiting
✅ Multiple Browsers: Can use Chromium, Firefox, or WebKit
✅ Modern APIs: Cleaner, more intuitive API
✅ Active Development: Microsoft-backed, frequent updates

Development

Project Structure

webscraper-mcp-server-v2/
├── src/
│   ├── index.ts           # Main entry point
│   ├── types.ts           # TypeScript definitions (enhanced)
│   ├── constants.ts       # Configuration constants
│   ├── schemas/           # Zod validation (updated)
│   ├── services/          # Playwright-based scraping
│   └── tools/             # MCP tool implementations
├── dist/                  # Compiled JavaScript
├── package.json           # Dependencies (with Playwright)
└── README.md

Building

npm run build

Testing

# With MCP Inspector
npx @modelcontextprotocol/inspector node dist/index.js

Limitations

Maximum 10 URLs for batch scraping
Content truncated at 100,000 characters
Request timeout: 1-120 seconds
Chromium browser required (~170MB download)
Supports only HTTP/HTTPS protocols
Requires publicly accessible URLs

Performance Tips

Use Static Mode When Possible: 3-5x faster for traditional sites
Batch Related URLs: More efficient than individual calls
Set Appropriate Timeouts: Longer for slow sites, shorter for fast ones
Use Selectors Wisely: Wait for specific elements instead of fixed times
Limit Screenshot Usage: Screenshots increase response size significantly

Comparison with v1.0

Feature	v1.0 (Cheerio Only)	v2.0 (Playwright)
Static HTML	✅ Fast	✅ Fast
JavaScript	❌	✅ Full Support
Auto-Detection	❌	✅ Smart Fallback
Screenshots	❌	✅ Base64 Output
Lazy Loading	❌	✅ Supported
SPAs	❌ Limited	✅ Full Support

License

MIT

Contributing

Contributions welcome! Areas for improvement:

[ ] Support for other Playwright browsers (Firefox, WebKit)
[ ] PDF generation from pages
[ ] Advanced selector strategies
[ ] Request interception for blocking ads
[ ] Cookie management
[ ] Proxy support

Support

For issues or questions, please open an issue on the GitHub repository.

Made with ❤️ using Microsoft Playwright and Model Context Protocol

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

WebScraper MCP Server

README

WebScraper MCP Server v2.0 (Playwright Edition)

🆕 What's New in v2.0

Features

Core Capabilities

Rendering Modes

Installation

Usage

Running with stdio (Local)

Running with HTTP (Remote)

Available Tools

1. webscraper_scrape_page - Advanced Web Scraping

2. webscraper_extract_links - Link Extraction

3. webscraper_extract_images - Image Extraction

4. webscraper_batch_scrape - Batch Operations

Configuration

Environment Variables

Client Configuration (Claude Desktop)

Output Formats

Markdown Format (Enhanced)

JSON Format (Enhanced)

Performance Comparison

Use Cases

1. Scraping JavaScript-Heavy Sites

2. Capturing Visual State

3. API Documentation Sites

4. E-commerce Product Pages

Troubleshooting

Playwright Issues

Low Word Count on Dynamic Sites

Memory Issues

Advantages over Puppeteer

Development

Project Structure

Building

Testing

Limitations

Performance Tips

Comparison with v1.0

License

Contributing

Support

Recommended Servers

1. `webscraper_scrape_page` - Advanced Web Scraping

2. `webscraper_extract_links` - Link Extraction

3. `webscraper_extract_images` - Image Extraction

4. `webscraper_batch_scrape` - Batch Operations