crawl-mcp-server

crawl-mcp-server

An MCP server for web content extraction that converts HTML pages into clean, LLM-optimized Markdown using Mozilla's Readability. It supports batch processing, intelligent multi-page crawling, and configurable caching while respecting robots.txt standards.

Category
Visit Server

README

crawl-mcp-server

A comprehensive MCP (Model Context Protocol) server providing 11 powerful tools for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.

CI Tests codecov

โœจ Features

  • ๐Ÿ” SearXNG Web Search - Search the web with automatic browser management
  • ๐Ÿ“„ 4 Crawling Tools - Extract and convert web content to Markdown
  • ๐Ÿš€ Auto-Browser-Launch - Search tools automatically manage browser lifecycle
  • ๐Ÿ“ฆ 11 Total Tools - Complete toolkit for web interaction
  • ๐Ÿ’พ Built-in Caching - SHA-256 based caching with graceful fallbacks
  • โšก Concurrent Processing - Handle multiple URLs simultaneously (up to 50)
  • ๐ŸŽฏ LLM-Optimized Output - Clean Markdown perfect for AI consumption
  • ๐Ÿ›ก๏ธ Robust Error Handling - Graceful failure with detailed error messages
  • ๐Ÿงช Comprehensive Testing - Full CI/CD with performance benchmarks

๐Ÿ“ฆ Installation

Method 1: npm (Recommended)

npm install crawl-mcp-server

Method 2: Direct from Git

# Install latest from GitHub
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Or specific branch
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main

# Or from a fork
npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git

Method 3: Clone and Build

git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
npm install
npm run build

Method 4: npx (No Installation)

# Run directly without installing
npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

๐Ÿ”ง Setup for Claude Code

Option 1: MCP Desktop (Recommended)

Add to your Claude Desktop configuration file:

** macOS/Linux: ~/.config/claude/claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

** Windows: %APPDATA%\Claude\claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

Option 2: Local Installation

If you've installed locally:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/path/to/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

Option 3: Custom Path

For a specific installation:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

After configuration, restart Claude Desktop.

๐Ÿ”ง Setup for Other MCP Clients

Claude CLI

# Using npx
claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Using local installation
claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js

Zed Editor

Add to ~/.config/zed/settings.json:

{
  "assistant": {
    "mcp": {
      "servers": {
        "crawl-server": {
          "command": "node",
          "args": ["/path/to/crawl-mcp-server/dist/index.js"]
        }
      }
    }
  }
}

VSCode with Copilot Chat

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": ["/path/to/crawl-mcp-server/dist/index.js"]
    }
  }
}

๐Ÿš€ Quick Start

Using MCP Inspector (Testing)

# Install MCP Inspector globally
npm install -g @modelcontextprotocol/inspector

# Run the server
node dist/index.js

# In another terminal, test tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list

Development Mode

# Watch mode (auto-rebuild on changes)
npm run dev

# Build TypeScript
npm run build

# Run tests
npm run test:run

๐Ÿ“š Available Tools

Search Tools (7 tools)

1. search_searx

Search the web using SearXNG with automatic browser management.

// Example call
{
  "query": "TypeScript MCP server",
  "maxResults": 10,
  "category": "general",
  "timeRange": "week",
  "language": "en"
}

Parameters:

  • query (string, required): Search query
  • maxResults (number, default: 20): Results to return (1-50)
  • category (enum, default: general): one of general, images, videos, news, map, music, it, science
  • timeRange (enum, optional): one of day, week, month, year
  • language (string, default: en): Language code

Returns: JSON with search results array, URLs, and metadata


2. launch_chrome_cdp

Launch system Chrome with remote debugging for advanced SearX usage.

{
  "headless": true,
  "port": 9222,
  "userDataDir": "/path/to/profile"
}

Parameters:

  • headless (boolean, default: true): Run Chrome headless
  • port (number, default: 9222): Remote debugging port
  • userDataDir (string, optional): Custom Chrome profile

3. connect_cdp

Connect to remote CDP browser (Browserbase, etc.).

{
  "cdpWsUrl": "http://localhost:9222"
}

Parameters:

  • cdpWsUrl (string, required): CDP WebSocket URL or HTTP endpoint

4. launch_local

Launch bundled Chromium for SearX search.

{
  "headless": true,
  "userAgent": "custom user agent string"
}

Parameters:

  • headless (boolean, default: true): Run headless
  • userAgent (string, optional): Custom user agent

5. chrome_status

Check Chrome CDP status and health.

{}

Returns: Running status, health, endpoint URL, and PID


6. close

Close browser session (keeps Chrome CDP running).

{}

7. shutdown_chrome_cdp

Shutdown Chrome CDP and cleanup resources.

{}

Crawling Tools (4 tools)

1. crawl_read โญ (Simple & Fast)

Quick single-page extraction to Markdown.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

  • โœ… News articles
  • โœ… Blog posts
  • โœ… Documentation pages
  • โœ… Simple content extraction

Returns: Clean Markdown content


2. crawl_read_batch โญ (Multiple URLs)

Process 1-50 URLs concurrently.

{
  "urls": [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3"
  ],
  "options": {
    "maxConcurrency": 5,
    "timeout": 30000,
    "maxResults": 10
  }
}

Best for:

  • โœ… Processing multiple articles
  • โœ… Building content aggregates
  • โœ… Bulk content extraction

Returns: Array of Markdown results with summary statistics


3. crawl_fetch_markdown

Controlled single-page extraction with full option control.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

  • โœ… Advanced crawling options
  • โœ… Custom timeout control
  • โœ… Detailed extraction

4. crawl_fetch

Multi-page crawling with intelligent link extraction.

{
  "url": "https://example.com",
  "options": {
    "pages": 5,
    "maxConcurrency": 3,
    "sameOriginOnly": true,
    "timeout": 30000,
    "maxResults": 20
  }
}

Best for:

  • โœ… Crawling entire sites
  • โœ… Link-based discovery
  • โœ… Multi-page scraping

Features:

  • Extracts links from starting page
  • Crawls discovered pages
  • Concurrent processing
  • Same-origin filtering (configurable)

๐Ÿ’ก Usage Examples

Example 1: Search + Crawl Workflow

// Step 1: Search for topics
{
  "tool": "search_searx",
  "arguments": {
    "query": "TypeScript best practices 2024",
    "maxResults": 5
  }
}

// Step 2: Extract URLs from results
// (Parse the search results to get URLs)

// Step 3: Crawl selected articles
{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2",
      "https://example.com/article3"
    ]
  }
}

Example 2: Batch Content Extraction

{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://news.site/article1",
      "https://news.site/article2",
      "https://news.site/article3"
    ],
    "options": {
      "maxConcurrency": 10,
      "timeout": 30000,
      "maxResults": 3
    }
  }
}

Example 3: Site Crawling

{
  "tool": "crawl_fetch",
  "arguments": {
    "url": "https://docs.example.com",
    "options": {
      "pages": 10,
      "maxConcurrency": 5,
      "sameOriginOnly": true,
      "timeout": 30000,
      "maxResults": 10
    }
  }
}

๐ŸŽฏ Tool Selection Guide

Use Case Recommended Tool Complexity
Single article crawl_read Simple
Multiple articles crawl_read_batch Simple
Advanced options crawl_fetch_markdown Medium
Site crawling crawl_fetch Complex
Web search search_searx Simple
Research workflow search_searx โ†’ crawl_read Medium

๐Ÿ—๏ธ Architecture

Core Components

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         crawl-mcp-server                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚     MCP Server Core         โ”‚      โ”‚
โ”‚  โ”‚  - 11 registered tools      โ”‚      โ”‚
โ”‚  โ”‚  - STDIO/HTTP transport    โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚              โ”‚                           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚   @just-every/crawl         โ”‚      โ”‚
โ”‚  โ”‚  - HTML โ†’ Markdown          โ”‚      โ”‚
โ”‚  โ”‚  - Mozilla Readability       โ”‚      โ”‚
โ”‚  โ”‚  - Concurrent crawling      โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚              โ”‚                           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚   Playwright (Browser)       โ”‚      โ”‚
โ”‚  โ”‚  - SearXNG integration       โ”‚      โ”‚
โ”‚  โ”‚  - Auto browser management   โ”‚      โ”‚
โ”‚  โ”‚  - Anti-detection            โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Technology Stack

  • Runtime: Node.js 18+
  • Language: TypeScript 5.7
  • Framework: MCP SDK (@modelcontextprotocol/sdk)
  • Crawling: @just-every/crawl
  • Browser: Playwright Core
  • Validation: Zod
  • Transport: STDIO (local) + HTTP (remote)

Data Flow

Client Request
    โ†“
MCP Protocol
    โ†“
Tool Handler
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Crawl/Search     โ”‚
โ”‚  @just-every/crawl โ”‚  โ†’  HTML content
โ”‚   or SearXNG       โ”‚  โ†’  Search results
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
HTML โ†’ Markdown
    โ†“
Result Formatting
    โ†“
MCP Response
    โ†“
Client

๐Ÿงช Testing

Run Test Suite

# All unit tests
npm run test:run

# Performance benchmarks
npm run test:performance

# Full CI suite
npm run test:ci

# Individual tool test
npx @modelcontextprotocol/inspector --cli node dist/index.js \
  --method tools/call \
  --tool-name crawl_read \
  --tool-arg url="https://example.com"

Test Coverage

  • โœ… All 11 tools tested
  • โœ… Error handling validated
  • โœ… Performance benchmarks
  • โœ… Integration workflows
  • โœ… Multi-Node support (Node 18, 20, 22)

CI/CD Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚        GitHub Actions              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1. Test (Matrix: Node 18,20,22) โ”‚
โ”‚  2. Integration Tests (PR only)    โ”‚
โ”‚  3. Performance Tests (main)       โ”‚
โ”‚  4. Security Scan                  โ”‚
โ”‚  5. Coverage Report                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ง Development

Prerequisites

  • Node.js 18 or higher
  • npm or yarn

Setup

# Clone the repository
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run in development mode (watch)
npm run dev

Development Commands

# Build project
npm run build

# Watch mode (auto-rebuild)
npm run dev

# Run tests
npm run test:run

# Lint code
npm run lint

# Type check
npm run typecheck

# Clean build artifacts
npm run clean

Project Structure

crawl-mcp-server/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.ts          # Main server (11 tools)
โ”‚   โ”œโ”€โ”€ types.ts           # TypeScript interfaces
โ”‚   โ””โ”€โ”€ cdp.ts            # Chrome CDP manager
โ”œโ”€โ”€ test/
โ”‚   โ”œโ”€โ”€ run-tests.ts       # Unit test suite
โ”‚   โ”œโ”€โ”€ performance.ts     # Performance tests
โ”‚   โ””โ”€โ”€ config.ts          # Test configuration
โ”œโ”€โ”€ dist/                  # Compiled JavaScript
โ”œโ”€โ”€ .github/workflows/      # CI/CD pipeline
โ””โ”€โ”€ package.json

๐Ÿ“Š Performance

Benchmarks

Operation Avg Duration Max Memory
crawl_read ~1500ms 32MB
crawl_read_batch (2 URLs) ~2500ms 64MB
search_searx ~4000ms 128MB
crawl_fetch ~2000ms 48MB
tools/list ~100ms 8MB

Performance Features

  • โœ… Concurrent request processing (up to 20)
  • โœ… Built-in caching (SHA-256)
  • โœ… Automatic timeout management
  • โœ… Memory optimization
  • โœ… Resource cleanup

๐Ÿ›ก๏ธ Error Handling

All tools include comprehensive error handling:

  • Network errors: Graceful degradation with error messages
  • Timeout handling: Configurable timeouts
  • Partial failures: Batch operations continue on individual failures
  • Structured errors: Clear error codes and messages
  • Recovery: Automatic retries where appropriate

Example error response:

{
  "content": [
    {
      "type": "text",
      "text": "Error: Failed to fetch https://example.com: Timeout after 30000ms"
    }
  ],
  "structuredContent": {
    "error": "Network timeout",
    "url": "https://example.com",
    "code": "TIMEOUT"
  }
}

๐Ÿ” Security

  • No API keys required for basic crawling
  • Respect robots.txt (configurable)
  • User agent rotation
  • Rate limiting (built-in via concurrency limits)
  • Input validation (Zod schemas)
  • Dependency scanning (npm audit, Snyk)

๐ŸŒ Transport Modes

STDIO (Default)

For local MCP clients:

node dist/index.js

HTTP

For remote access:

TRANSPORT=http PORT=3000 node dist/index.js

Server runs on: http://localhost:3000/mcp

๐Ÿ“ Configuration

Environment Variables

# Transport mode (stdio or http)
TRANSPORT=stdio

# HTTP port (when TRANSPORT=http)
PORT=3000

# Node environment
NODE_ENV=production

Tool Configuration

Each tool accepts an options object:

{
  "timeout": 30000,          // Request timeout (ms)
  "maxConcurrency": 5,       // Concurrent requests (1-20)
  "maxResults": 10,          // Limit results (1-50)
  "respectRobots": false,    // Respect robots.txt
  "sameOriginOnly": true     // Only same-origin URLs
}

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make changes and add tests
  4. Run tests: npm run test:ci
  5. Commit: git commit -m 'Add amazing feature'
  6. Push: git push origin feature/amazing-feature
  7. Open a Pull Request

Development Guidelines

  • Follow TypeScript strict mode
  • Add tests for new features
  • Update documentation
  • Run linting: npm run lint
  • Ensure CI passes

๐Ÿ“„ License

MIT License - see LICENSE file

๐Ÿ™ Acknowledgments

๐Ÿ“ž Support

๐Ÿš€ What's Next?

  • [ ] Add DuckDuckGo search support
  • [ ] Implement content filtering
  • [ ] Add screenshot capabilities
  • [ ] Support for authenticated content
  • [ ] PDF extraction
  • [ ] Real-time monitoring

Built with โค๏ธ using TypeScript, MCP, and modern web technologies.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured