mcp-trafilatura-server
This MCP server enables clean web content extraction from URLs or HTML using Trafilatura, supporting multiple output formats and configurable extraction options.
README
MCP Trafilatura Server
A Model Context Protocol (MCP) server that provides web content extraction capabilities using Trafilatura's Python API. This server allows MCP clients to extract clean, readable content from web pages and HTML documents.
Features
- Clean Content Extraction: Uses Trafilatura's advanced algorithms to extract main article content while filtering out navigation, ads, and boilerplate
- Multiple Output Formats: Supports Markdown, plain text, and XML output
- Flexible Input: Accept either URLs (with automatic fetching) or raw HTML content
- Configurable Extraction: Fine-tune extraction behavior with precision, inclusion of tables, images, links, and comments
- Async Implementation: Non-blocking operations with proper timeout handling
- Robust Error Handling: Comprehensive error handling with informative error messages
- Type Safety: Full type hints and Pydantic validation
Installation
Quick Install via pip
pip install mcp-trafilatura-server
Option 1: Using uvx (Recommended - No Installation)
The easiest way to run the server with any MCP client:
# Run directly with uvx (no installation needed)
uvx mcp-trafilatura-server
# Or specify from local directory during development
uvx --from . mcp-trafilatura-server
For use with Claude Desktop, add to your configuration:
{
"mcpServers": {
"trafilatura": {
"command": "uvx",
"args": ["mcp-trafilatura-server"]
}
}
}
Or for local development:
{
"mcpServers": {
"trafilatura": {
"command": "uvx",
"args": ["--from", "/path/to/mcp-trafilatura", "mcp-trafilatura-server"]
}
}
}
Option 2: Using uv for Development
# Clone the repository
git clone https://github.com/achieveai/mcp-web-extractor.git
cd mcp-trafilatura
# Create virtual environment and install
uv venv
uv pip install -e .
# Run the server
uv run mcp-trafilatura-server
Option 3: Traditional pip Installation
# Install from PyPI
pip install mcp-trafilatura-server
# Or for development:
git clone https://github.com/achieveai/mcp-web-extractor.git
cd mcp-trafilatura
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
Option 4: Using requirements.txt
# Install dependencies directly
pip install -r requirements.txt
# Then install the package
pip install -e .
Usage
As an MCP Server
The server can be used with any MCP-compatible client. Register it in your client configuration:
{
"mcpServers": {
"web-extractor": {
"command": "mcp-trafilatura-server",
"args": []
}
}
}
Direct Usage
You can also run the server directly:
# Run the server (listens on stdio)
mcp-trafilatura-server
# Or run the module directly
python -m trafilatura_mcp.server
Tool: extract_markdown
The server provides a single tool called extract_markdown that extracts content from web pages or HTML.
Parameters
Required (one of):
url(string): URL to fetch and extract content from (http/https only)html(string): Raw HTML content to extract from
Optional:
precision(boolean, default: true): Favor precision over recall in extractioninclude_comments(boolean, default: false): Include HTML comments in outputinclude_tables(boolean, default: true): Include tables in extracted contentinclude_images(boolean, default: true): Include images in extracted contentinclude_links(boolean, default: true): Include links in extracted contenttimeout(integer, default: 30): Request timeout in seconds for URL fetching (5-120)output_format(string, default: "markdown"): Output format - "markdown", "txt", or "xml"
Example Usage
Extracting from a URL:
{
"name": "extract_markdown",
"arguments": {
"url": "https://example.com/article",
"precision": true,
"include_tables": true,
"output_format": "markdown"
}
}
Extracting from HTML:
{
"name": "extract_markdown",
"arguments": {
"html": "<html><body><h1>Title</h1><p>Content...</p></body></html>",
"include_comments": false,
"output_format": "txt"
}
}
Minimal usage:
{
"name": "extract_markdown",
"arguments": {
"url": "https://news.ycombinator.com/"
}
}
Configuration for Popular MCP Clients
Claude Desktop
Add to your Claude Desktop configuration file:
Windows: %APPDATA%\Claude\claude_desktop_config.json
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Using uvx (recommended - no installation needed):
{
"mcpServers": {
"trafilatura": {
"command": "uvx",
"args": ["mcp-trafilatura-server"]
}
}
}
Or if installed via pip:
{
"mcpServers": {
"web-extractor": {
"command": "mcp-trafilatura-server",
"args": []
}
}
}
VS Code with Continue
Add to your Continue configuration:
{
"mcpServers": [
{
"name": "trafilatura",
"command": "uvx",
"args": ["mcp-trafilatura-server"]
}
]
}
Development
Project Structure
mcp-trafilatura/
├── src/
│ └── trafilatura_mcp/
│ ├── __init__.py
│ ├── server.py # Main server implementation
│ └── py.typed # Type hints marker
├── pyproject.toml # Package configuration
├── requirements.txt # Dependencies
├── LICENSE # GPL-3.0 license
├── MANIFEST.in # Package data files
└── README.md # This file
Development Setup
# Clone the repository
git clone https://github.com/achieveai/mcp-web-extractor.git
cd mcp-trafilatura
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Run type checking
mypy src/
# Run linting
ruff check src/
# Format code
black src/
isort src/
Testing
# Install test dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=trafilatura_mcp
Error Handling
The server provides comprehensive error handling:
- Input Validation: Invalid URLs, missing required parameters, or invalid parameter values
- Network Errors: Connection timeouts, HTTP errors, or unreachable URLs
- Extraction Errors: Empty content, parsing failures, or unsupported content types
- Server Errors: Internal errors with detailed logging
All errors are returned as proper MCP error responses with descriptive messages.
Logging
The server uses Python's built-in logging with INFO level by default. Logs include:
- Server startup and shutdown
- URL fetching attempts
- Content extraction operations
- Error conditions with details
Dependencies
- modelcontextprotocol: MCP protocol implementation
- pydantic: Data validation and settings management
- trafilatura: Core content extraction functionality
- httpx: Async HTTP client for URL fetching
- typing-extensions: Additional type hints support
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0) because it directly imports and uses Trafilatura's Python API, which is GPL-3.0 licensed. See the LICENSE file for details.
Important: Since this server uses Trafilatura's Python API (Option B from docs), it constitutes a derivative work under GPL-3.0. If you need different licensing terms, consider:
- Using Option A (CLI subprocess approach) for more licensing flexibility
- Implementing your own extraction logic
- Contacting Trafilatura maintainers for commercial licensing
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Support
For issues, questions, or contributions, please visit the project repository or open an issue.
Changelog
v0.1.0
- Initial release
- Basic content extraction from URLs and HTML
- Support for multiple output formats
- Comprehensive error handling and validation
- Async implementation with timeout support
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.