Fetch as Markdown
Fetches web pages and converts them to clean, readable markdown format by extracting main content while removing navigation, ads, and other non-essential elements to minimize token usage.
README
Fetch as Markdown MCP Server
A Model Context Protocol (MCP) server that fetches web pages and converts them to clean, readable markdown format, focusing on main content extraction while minimizing context overhead.
Overview
This MCP server acts as a bridge between AI assistants and the web, specifically designed to:
- Extract Clean Content: Focuses on main article content, removing navigation, ads, and sidebars
- Minimize Context: Strips unnecessary elements to reduce token usage while preserving content structure
- Respectful Scraping: Implements proper rate limiting, user-agent headers, and timeout handling
- Error Resilience: Gracefully handles various web-related errors and edge cases
Features
๐ Web Page Fetching
- Fetch any publicly accessible web page
- Automatic redirect handling with final URL reporting
- Configurable timeouts and proper error handling
- Respectful rate limiting (1-second intervals between requests)
๐งน Content Cleaning
- Removes navigation, ads, sidebars, and other non-essential elements
- Focuses on main content areas using semantic HTML detection
- Strips unnecessary HTML attributes to reduce token usage
- Preserves content structure and readability
๐ Markdown Conversion
- Converts HTML to clean, readable markdown
- Configurable link and image inclusion
- Proper heading hierarchy and formatting
- Post-processing to remove excessive whitespace
Installation & Setup
Prerequisites
- Python 3.12 or higher
uvpackage manager
Quick Start
Run directly with uvx:
uvx git+https://github.com/bhubbb/mcp-fetch-as-markdown
Or install locally:
-
Clone or download this project
-
Install dependencies:
cd mcp-fetch-as-markdown uv sync -
Run the server:
uv run python main.py
Integration with AI Assistants
This MCP server is designed to work with AI assistants that support the Model Context Protocol. Configure your AI assistant to connect to this server via stdio.
Example configuration for Claude Desktop:
{
"mcpServers": {
"fetch-as-markdown": {
"command": "uvx",
"args": ["git+https://github.com/bhubbb/mcp-fetch-as-markdown"]
}
}
}
Or if using a local installation:
{
"mcpServers": {
"fetch-as-markdown": {
"command": "uv",
"args": ["run", "python", "/path/to/mcp-fetch-as-markdown/main.py"]
}
}
}
Available Tools
fetch
Fetch a web page and convert it to clean markdown format.
Parameters:
url(required): URL of the web page to fetch and convertinclude_links(optional): Whether to preserve links in markdown output (default: true)include_images(optional): Whether to include image references (default: false)timeout(optional): Request timeout in seconds (5-30, default: 10)
Returns:
- Fetch metadata (original URL, final URL, title, content length, status code, content type)
- Clean markdown content with proper formatting
Example:
{
"name": "fetch",
"arguments": {
"url": "https://example.com/article",
"include_links": true,
"include_images": false,
"timeout": 15
}
}
How It Works
Content Extraction Strategy
- Fetch Page: Makes HTTP request with proper headers and timeout handling
- Parse HTML: Uses BeautifulSoup to parse the HTML content
- Remove Unwanted Elements: Strips scripts, styles, navigation, ads, sidebars, footers
- Find Main Content: Looks for semantic elements like
<main>,<article>, or common content classes - Clean Attributes: Removes unnecessary HTML attributes to reduce size
- Convert to Markdown: Uses configurable markdown conversion with proper formatting
- Post-process: Removes excessive whitespace and blank lines
Respectful Web Scraping
- Rate Limiting: Minimum 1-second interval between requests
- User Agent: Proper identification as "MCP-Fetch-As-Markdown" tool
- Timeout Handling: Configurable timeouts to avoid hanging requests
- Error Handling: Graceful handling of network issues, HTTP errors, and malformed content
- Redirect Support: Follows redirects and reports final URLs
Structured Output Format
All responses include:
- Metadata Block: Original URL, final URL, page title, content statistics, HTTP status
- Content Block: Clean markdown conversion of the main page content
This structure makes responses both human-readable and machine-parseable while minimizing token usage.
Error Handling
- Invalid URLs: Clear validation and error messages
- Network Issues: Timeout, connection error, and DNS failure handling
- HTTP Errors: Proper handling of 404, 500, and other HTTP status codes
- Malformed Content: Graceful handling of broken HTML and encoding issues
Use Cases
For Research & Analysis
- Convert articles and blog posts to clean markdown for analysis
- Extract main content from news articles and research papers
- Gather information while minimizing irrelevant context
For Content Processing
- Prepare web content for further AI processing
- Extract clean text from web pages for summarization
- Convert HTML content to markdown for documentation
For AI Assistants
- Fetch and process web content with minimal token overhead
- Extract relevant information while filtering out noise
- Provide clean, structured content for AI reasoning
Examples
Basic Page Fetching
Ask your AI assistant: "Fetch the content from this article URL as markdown"
The server will:
- Fetch the web page with proper headers and rate limiting
- Extract the main content area, removing navigation and ads
- Convert to clean markdown format
- Return structured metadata and content
With Link Preservation
Ask your AI assistant: "Fetch this page but keep all the links intact"
The server will:
- Fetch and process the page normally
- Preserve all hyperlinks in markdown format
[text](url) - Maintain link structure while cleaning other elements
Error Handling Example
Ask your AI assistant: "Try to fetch content from this broken URL"
The server will:
- Validate the URL format
- Attempt the request with proper timeout
- Return a structured error message if the request fails
- Provide helpful information about what went wrong
Development
Project Structure
mcp-fetch-as-markdown/
โโโ main.py # Main MCP server implementation
โโโ pyproject.toml # Project dependencies and metadata
โโโ AGENT.md # Development rules and guidelines
โโโ example.py # Usage examples and demonstrations
โโโ .venv/ # Virtual environment (created by uv)
Key Dependencies
mcp: Model Context Protocol frameworkrequests: HTTP request handlingbeautifulsoup4: HTML parsing and content extractionmarkdownify: HTML to markdown conversion
Customization
The server can be easily customized by modifying main.py:
- Content Selectors: Modify the CSS selectors used to find main content
- Rate Limiting: Adjust the minimum interval between requests
- Timeout Settings: Change default and maximum timeout values
- Content Filtering: Add custom content processing or filtering rules
- Markdown Options: Customize markdown conversion settings
Testing the Server
Test the server directly:
uvx git+https://github.com/bhubbb/mcp-fetch-as-markdown
Or with local installation:
cd mcp-fetch-as-markdown
uv run python main.py
For interactive testing, use the example script:
uv run python example.py
Troubleshooting
Common Issues
- Import Errors: Make sure all dependencies are installed with
uv sync - Connection Timeouts: Some websites may be slow; try increasing the timeout parameter
- Rate Limiting: The server enforces 1-second intervals between requests
- Blocked Requests: Some websites may block automated requests; this is expected behavior
Debugging
Enable debug logging by modifying the logging level in main.py:
logging.basicConfig(level=logging.DEBUG)
Website Compatibility
- Modern Websites: Works best with standard HTML structure
- JavaScript-heavy Sites: Cannot execute JavaScript; fetches initial HTML only
- Protected Content: Respects robots.txt and website access restrictions
- Rate Limits: Implements respectful scraping practices
Ethical Usage
This tool is designed for legitimate research, analysis, and content processing. Please:
- Respect Terms of Service: Always check and comply with website terms of service
- Avoid Overloading: The built-in rate limiting helps, but be mindful of request frequency
- Attribution: Give proper credit to original sources when using extracted content
- Legal Compliance: Ensure your use case complies with applicable laws and regulations
Contributing
This is a simple, single-file implementation designed for clarity and ease of modification. Feel free to:
- Add support for additional content extraction strategies
- Implement custom filtering for specific website types
- Add caching for better performance
- Extend with additional markdown formatting options
License
This project uses the same license as its dependencies. Content fetched from websites remains subject to the original website's terms of service and copyright.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
E2B
Using MCP to run code via e2b.