mcp-trafilatura-server

mcp-trafilatura-server

This MCP server enables clean web content extraction from URLs or HTML using Trafilatura, supporting multiple output formats and configurable extraction options.

Category
Visit Server

README

MCP Trafilatura Server

PyPI version Python License: GPL v3

A Model Context Protocol (MCP) server that provides web content extraction capabilities using Trafilatura's Python API. This server allows MCP clients to extract clean, readable content from web pages and HTML documents.

Features

  • Clean Content Extraction: Uses Trafilatura's advanced algorithms to extract main article content while filtering out navigation, ads, and boilerplate
  • Multiple Output Formats: Supports Markdown, plain text, and XML output
  • Flexible Input: Accept either URLs (with automatic fetching) or raw HTML content
  • Configurable Extraction: Fine-tune extraction behavior with precision, inclusion of tables, images, links, and comments
  • Async Implementation: Non-blocking operations with proper timeout handling
  • Robust Error Handling: Comprehensive error handling with informative error messages
  • Type Safety: Full type hints and Pydantic validation

Installation

Quick Install via pip

pip install mcp-trafilatura-server

Option 1: Using uvx (Recommended - No Installation)

The easiest way to run the server with any MCP client:

# Run directly with uvx (no installation needed)
uvx mcp-trafilatura-server

# Or specify from local directory during development
uvx --from . mcp-trafilatura-server

For use with Claude Desktop, add to your configuration:

{
  "mcpServers": {
    "trafilatura": {
      "command": "uvx",
      "args": ["mcp-trafilatura-server"]
    }
  }
}

Or for local development:

{
  "mcpServers": {
    "trafilatura": {
      "command": "uvx",
      "args": ["--from", "/path/to/mcp-trafilatura", "mcp-trafilatura-server"]
    }
  }
}

Option 2: Using uv for Development

# Clone the repository
git clone https://github.com/achieveai/mcp-web-extractor.git
cd mcp-trafilatura

# Create virtual environment and install
uv venv
uv pip install -e .

# Run the server
uv run mcp-trafilatura-server

Option 3: Traditional pip Installation

# Install from PyPI
pip install mcp-trafilatura-server

# Or for development:
git clone https://github.com/achieveai/mcp-web-extractor.git
cd mcp-trafilatura
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

Option 4: Using requirements.txt

# Install dependencies directly
pip install -r requirements.txt

# Then install the package
pip install -e .

Usage

As an MCP Server

The server can be used with any MCP-compatible client. Register it in your client configuration:

{
  "mcpServers": {
    "web-extractor": {
      "command": "mcp-trafilatura-server",
      "args": []
    }
  }
}

Direct Usage

You can also run the server directly:

# Run the server (listens on stdio)
mcp-trafilatura-server

# Or run the module directly
python -m trafilatura_mcp.server

Tool: extract_markdown

The server provides a single tool called extract_markdown that extracts content from web pages or HTML.

Parameters

Required (one of):

  • url (string): URL to fetch and extract content from (http/https only)
  • html (string): Raw HTML content to extract from

Optional:

  • precision (boolean, default: true): Favor precision over recall in extraction
  • include_comments (boolean, default: false): Include HTML comments in output
  • include_tables (boolean, default: true): Include tables in extracted content
  • include_images (boolean, default: true): Include images in extracted content
  • include_links (boolean, default: true): Include links in extracted content
  • timeout (integer, default: 30): Request timeout in seconds for URL fetching (5-120)
  • output_format (string, default: "markdown"): Output format - "markdown", "txt", or "xml"

Example Usage

Extracting from a URL:

{
  "name": "extract_markdown",
  "arguments": {
    "url": "https://example.com/article",
    "precision": true,
    "include_tables": true,
    "output_format": "markdown"
  }
}

Extracting from HTML:

{
  "name": "extract_markdown",
  "arguments": {
    "html": "<html><body><h1>Title</h1><p>Content...</p></body></html>",
    "include_comments": false,
    "output_format": "txt"
  }
}

Minimal usage:

{
  "name": "extract_markdown",
  "arguments": {
    "url": "https://news.ycombinator.com/"
  }
}

Configuration for Popular MCP Clients

Claude Desktop

Add to your Claude Desktop configuration file:

Windows: %APPDATA%\Claude\claude_desktop_config.json macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

Using uvx (recommended - no installation needed):

{
  "mcpServers": {
    "trafilatura": {
      "command": "uvx",
      "args": ["mcp-trafilatura-server"]
    }
  }
}

Or if installed via pip:

{
  "mcpServers": {
    "web-extractor": {
      "command": "mcp-trafilatura-server",
      "args": []
    }
  }
}

VS Code with Continue

Add to your Continue configuration:

{
  "mcpServers": [
    {
      "name": "trafilatura",
      "command": "uvx",
      "args": ["mcp-trafilatura-server"]
    }
  ]
}

Development

Project Structure

mcp-trafilatura/
├── src/
│   └── trafilatura_mcp/
│       ├── __init__.py
│       ├── server.py          # Main server implementation
│       └── py.typed           # Type hints marker
├── pyproject.toml             # Package configuration
├── requirements.txt           # Dependencies
├── LICENSE                    # GPL-3.0 license
├── MANIFEST.in                # Package data files
└── README.md                  # This file

Development Setup

# Clone the repository
git clone https://github.com/achieveai/mcp-web-extractor.git
cd mcp-trafilatura

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Run type checking
mypy src/

# Run linting
ruff check src/

# Format code
black src/
isort src/

Testing

# Install test dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=trafilatura_mcp

Error Handling

The server provides comprehensive error handling:

  • Input Validation: Invalid URLs, missing required parameters, or invalid parameter values
  • Network Errors: Connection timeouts, HTTP errors, or unreachable URLs
  • Extraction Errors: Empty content, parsing failures, or unsupported content types
  • Server Errors: Internal errors with detailed logging

All errors are returned as proper MCP error responses with descriptive messages.

Logging

The server uses Python's built-in logging with INFO level by default. Logs include:

  • Server startup and shutdown
  • URL fetching attempts
  • Content extraction operations
  • Error conditions with details

Dependencies

  • modelcontextprotocol: MCP protocol implementation
  • pydantic: Data validation and settings management
  • trafilatura: Core content extraction functionality
  • httpx: Async HTTP client for URL fetching
  • typing-extensions: Additional type hints support

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0) because it directly imports and uses Trafilatura's Python API, which is GPL-3.0 licensed. See the LICENSE file for details.

Important: Since this server uses Trafilatura's Python API (Option B from docs), it constitutes a derivative work under GPL-3.0. If you need different licensing terms, consider:

  • Using Option A (CLI subprocess approach) for more licensing flexibility
  • Implementing your own extraction logic
  • Contacting Trafilatura maintainers for commercial licensing

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass
  6. Submit a pull request

Support

For issues, questions, or contributions, please visit the project repository or open an issue.

Changelog

v0.1.0

  • Initial release
  • Basic content extraction from URLs and HTML
  • Support for multiple output formats
  • Comprehensive error handling and validation
  • Async implementation with timeout support

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured