singlefile-mcp
An MCP server for intelligent web content extraction from JavaScript-heavy sites using single-file and trafilatura. It enables AI agents to fetch, render, and paginate through clean article content and metadata.
README
Single-File MCP Server
A powerful Model Context Protocol (MCP) server that provides intelligent web content extraction using single-file and trafilatura. Perfect for AI agents that need to access and analyze web content from JavaScript-heavy sites.
GitHub Repository: https://github.com/kwinsch/singlefile-mcp
Features
🌐 Universal Web Content Access
- JavaScript Support: Handles modern SPA/React/Vue apps that require browser rendering
- Clean Content Extraction: Uses Mozilla's Readability algorithm via trafilatura
- Rich Metadata: Extracts title, author, date, description, and more
- Multiple Output Formats: Raw HTML or clean markdown-like content
📄 Smart Pagination & Token Management
- Flexible Pagination: Offset/limit system like file reading tools
- Token Limits: Configurable max tokens (up to 25,000)
- Smart Truncation: Summary mode shows beginning + end, truncate mode cuts cleanly
- Navigation Hints: Clear guidance on how to continue reading large documents
⚡ Performance & Control
- Selective Loading: Block images/scripts for faster processing
- Content Compression: Optional HTML compression
- Timeout Protection: Configurable timeouts prevent hanging
- Error Handling: Graceful degradation when extraction fails
Installation
Prerequisites
- Python 3.8+
- single-file CLI - Web page capture tool
- Node.js 16+ (for single-file)
- A supported browser (Chromium, Chrome, Edge, Firefox, etc.)
Install single-file CLI
The single-file CLI is essential for this MCP server to work. It uses a real browser engine to accurately capture JavaScript-rendered content.
npm install -g single-file-cli
Usage with Claude Code
Quick Install (from PyPI)
claude mcp add singlefile-mcp -s user -- uvx singlefile-mcp
This will automatically install and run the package from PyPI, similar to how Brave Search works!
Development Install (from local directory)
claude mcp add singlefile-mcp -s user -- uvx --from /path/to/single-file_mcp singlefile-mcp
Remove old server (if upgrading)
claude mcp remove single-file-fetcher --scope user
Optional: Add Brave Search MCP
claude mcp add brave-search -s user -- env BRAVE_API_KEY=YOUR_KEY npx -y @modelcontextprotocol/server-brave-search
API Reference
fetch_webpage
Fetch and process web content with intelligent extraction.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL of the webpage to fetch |
output_content |
boolean | true |
Whether to return content in response |
extract_content |
boolean | false |
Extract clean text content (recommended) |
include_metadata |
boolean | true |
Include page metadata (title, author, etc.) |
block_images |
boolean | false |
Block image downloads for faster processing |
block_scripts |
boolean | true |
Block JavaScript execution |
compress_html |
boolean | true |
Compress HTML output |
max_tokens |
number | 20000 |
Maximum tokens in response (max: 25000) |
truncate_method |
string | "truncate" |
How to handle large content: "truncate" or "summary" |
offset |
number | 0 |
Character offset to start reading from |
limit |
number | null |
Maximum characters to return |
Examples
Basic content extraction:
fetch_webpage(
url="https://example.com/article",
extract_content=True,
include_metadata=True
)
Paginated reading of large documents:
# Get overview
fetch_webpage(
url="https://docs.example.com/guide",
extract_content=True,
limit=5000
)
# Continue reading from offset
fetch_webpage(
url="https://docs.example.com/guide",
extract_content=True,
offset=5000,
limit=5000
)
Raw HTML for complex parsing:
fetch_webpage(
url="https://app.example.com/dashboard",
extract_content=False,
block_scripts=False,
max_tokens=15000
)
Practical Example: Research Workflow
Here's a real-world example combining Brave Search and Single-File MCP:
Step 1: Search for information
# Using Brave Search MCP
brave_web_search(
query="artificial intelligence history timeline",
count=5
)
Step 2: Fetch and analyze Wikipedia article
# Using Single-File MCP to extract content
fetch_webpage(
url="https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
extract_content=True,
include_metadata=True,
limit=5000 # Get first 5000 chars
)
Result:
Successfully fetched webpage: https://en.wikipedia.org/wiki/History_of_artificial_intelligence
## Metadata
**Title:** History of artificial intelligence - Wikipedia
**Description:** The history of artificial intelligence (AI) began in antiquity...
**Site:** wikipedia.org
## Extracted Content (chars 0-5000 of 45000)
*Note: More content available. Use offset=5000 to continue.*
# History of artificial intelligence
The history of artificial intelligence (AI) began in antiquity, with myths,
stories and rumors of artificial beings endowed with intelligence...
[Clean, readable article content follows...]
Step 3: Continue reading with pagination
# Get next section
fetch_webpage(
url="https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
extract_content=True,
offset=5000,
limit=5000
)
This workflow enables AI agents to:
- Search for current information beyond their training data
- Extract clean, structured content from any webpage
- Process JavaScript-heavy sites that other tools can't handle
- Paginate through long documents intelligently
Output Format
With Content Extraction
Successfully fetched webpage: https://example.com
## Metadata
**Title:** Example Article
**Author:** John Doe
**Date:** 2024-01-15
**Description:** An informative article about...
**Site:** example.com
## Extracted Content (chars 0-5000 of 12000)
*Note: More content available. Use offset=5000 to continue.*
# Article Title
This is the clean, readable content extracted from the webpage...
Pagination Info
When using offset/limit, responses include:
- Current position:
chars 1000-6000 of 12000 - Navigation hint:
Use offset=6000 to continue - Total size information
Use Cases
📚 Documentation Analysis
Perfect for reading large technical docs, API references, and guides that span multiple pages.
📰 News & Article Processing
Extract clean article content from news sites, blogs, and publications for analysis.
🔍 Research & Data Gathering
Gather structured data from websites, including metadata and clean text content.
🤖 AI Agent Integration
Enable AI agents to browse and understand web content, even from JavaScript-heavy applications.
⚖️ Legal Document Processing
Handle complex legal documents and government sites that require JavaScript rendering.
Technical Details
Content Extraction Pipeline
- single-file: Renders JavaScript and saves complete webpage
- trafilatura: Extracts main content using Mozilla Readability algorithm
- Pagination: Applies offset/limit for manageable chunks
- Token Management: Ensures responses fit within LLM context limits
Browser Engine
Uses a browser via single-file for full JavaScript support:
- Works with any supported browser installed on your system
- Waits for network idle before capture
- Removes hidden elements and unused styles
- Handles dynamic content loading
Metadata Extraction
Automatically extracts:
- Page title and description
- Author and publication date
- Site name and language
- Categories and tags (when available)
Error Handling
- Network Issues: Graceful timeout with informative errors
- JavaScript Errors: Continues processing even if some scripts fail
- Large Content: Automatic truncation with clear indicators
- Invalid URLs: Clear validation error messages
Development Setup
- Clone the repository:
git clone https://github.com/kwinsch/singlefile-mcp.git
cd singlefile-mcp
- Install dependencies:
pip install -r requirements.txt
- Install in development mode:
pip install -e .
- Test locally with Claude Code:
claude mcp add singlefile-mcp -s user -- uvx --from . singlefile-mcp
License
MIT License - see LICENSE file for details.
Dependencies
- single-file - Core web page capture tool that handles JavaScript rendering
- trafilatura - Content extraction using Mozilla's Readability algorithm
- mcp - Model Context Protocol for AI integration
Acknowledgments
- single-file by Gildas Lormeau - Excellent web page capture tool
- trafilatura - Robust content extraction library
- Model Context Protocol - Standardized AI integration protocol
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.