fetch-guard
Fetch URLs and return clean, LLM-ready markdown with metadata and layered prompt injection defense. Configurable timeouts, word limits, JS rendering, and link extraction. All-in-one MCP server + CLI.
README
Fetch Guard
<a href="https://glama.ai/mcp/servers/@Erodenn/fetch-guard"> <img width="380" height="200" src="https://glama.ai/mcp/servers/@Erodenn/fetch-guard/badge" alt="fetch-guard MCP server" /> </a>
An MCP server and CLI tool that fetches URLs and returns clean, LLM-ready markdown. A purpose-built extraction pipeline sanitizes HTML, pulls structured metadata, detects prompt injection attempts, and handles the edge cases that break naive fetchers: bot blocks, paywalls, login walls, non-HTML content types, and pages that require JavaScript to render.
The core problem is straightforward: LLMs need web content, but raw HTML is noisy and potentially hostile. Fetched pages can contain hidden text, invisible Unicode, off-screen elements, and outright prompt injection attempts embedded in the content itself. This pipeline strips all of that before the content reaches the model.
Three layers handle the injection defense specifically:
- Pre-extraction sanitization removes hidden elements (
display:none,visibility:hidden,opacity:0), elements hidden via CSS class/ID rules in<style>tags, off-screen positioned content,aria-hiddenelements,<noscript>tags, and 26 categories of non-printing Unicode characters including bidi isolates and Unicode Tags. This happens before content extraction, so trafilatura never sees the attack vectors. - Pattern scanning runs a three-phase scan against the extracted text. Phase one applies 14 compiled regex patterns covering system prompt overrides, ignore-previous instructions, role injection, fake conversation tags, and hidden instruction markers. Phase two normalizes the text via NFKC and confusable-character mapping, then rescans to catch homoglyph bypasses (Cyrillic characters substituted for Latin, etc.). Phase three finds base64 and hex encoded blocks, decodes them, and scans the decoded content with high-severity patterns.
- Session-salted output wrapping generates a random 8-character hex salt per invocation and wraps the body in
<fetch-content-{salt}>tags. Since the salt is unpredictable, injected content cannot spoof the wrapper boundaries.
One Tool
This is a single-tool MCP server. It exposes one tool — fetch — that runs a full extraction pipeline behind a consistent interface. No tool selection, no routing, no multi-step workflows. One URL in, one structured result out, configurable via parameters.
Quick Start
Prerequisites
- Python 3.10+
- pip
Install
pip install fetch-guard
For JavaScript rendering (optional):
pip install 'fetch-guard[js]' && playwright install chromium
Configure Your MCP Client
Add the following to your MCP client config. Works with Claude Code, Claude Desktop, Cursor, or any MCP-compatible client.
Via uvx (recommended):
{
"mcpServers": {
"fetch-guard": {
"command": "uvx",
"args": ["fetch-guard"]
}
}
}
Via pip install:
{
"mcpServers": {
"fetch-guard": {
"command": "fetch-guard"
}
}
}
From source:
{
"mcpServers": {
"fetch-guard": {
"command": "python",
"args": ["-m", "fetch_guard.server"]
}
}
}
Via Docker:
{
"mcpServers": {
"fetch-guard": {
"command": "docker",
"args": ["run", "-i", "--rm", "sterlsnyc/fetch-guard"]
}
}
}
Note: The Docker image does not include Playwright. JavaScript rendering (
js: true) is not available when running via Docker. Use theuvxorpipinstall if you need JS rendering.
Verify
Ask your AI assistant to fetch any URL. If it returns structured content with a status header, metadata, and risk assessment, you're connected.
CLI
fetch-guard-cli <url> [options]
# or: python -m fetch_guard.cli <url> [options]
| Flag | Default | Description |
|---|---|---|
--timeout N |
180 | Request timeout in seconds |
--max-words N |
none | Word cap on extracted body content |
--js |
off | Use Playwright for JS-rendered pages |
--strict |
off | Exit code 2 on high-risk injection |
--links MODE |
domains |
domains for unique external domains, full for all URLs with anchor text |
Tool Parameters
The MCP fetch tool accepts these parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | The URL to fetch |
timeout |
integer | 180 | Request timeout in seconds. Ensures the tool always returns — no hanging fetches |
max_words |
integer | none | Word cap on extracted body content |
strict |
boolean | false | When true and high-risk injection is detected, the response is marked as an error |
js |
boolean | false | Use Playwright for JavaScript-rendered pages (requires fetch-guard[js]) |
links |
string | "domains" |
"domains" for unique external domains, "full" for all URLs with anchor text |
Claude Code Skill
Copy resources/fetch-guard/ to .claude/skills/fetch-guard/ in your project, or use the standalone command file resources/fetch-guard.md as a Claude Code command.
What It Does
The pipeline runs a 13-step sequence from URL to structured output:
-
/llms.txtpreflight. Checks the domain root for/llms.txtbefore the full fetch. If the requested URL is a domain root and/llms.txtexists, that content replaces the normal HTML pipeline entirely. This respects the emerging convention for LLM-friendly site summaries. -
Fetch. Static HTTP request via
requests, or Playwright-driven browser rendering if--jsis set. No automatic fallback between the two:--jsis explicit opt-in. -
Edge detection. Classifies the response for bot blocks (Cloudflare challenges, 403/429/503 with block signatures, LinkedIn's custom 999), paywalls (subscription prompts, premium overlays), and login walls (sign-in redirects, members-only patterns).
-
Automatic retry. Bot blocks trigger one retry with a full Chrome User-Agent string before reporting. Paywalls and login walls are reported immediately with no retry.
-
Content-type routing. Non-HTML responses get a fast path: JSON is rendered as a fenced code block, RSS/Atom feeds are parsed into structured summaries, CSV becomes a markdown table (capped at 2,000 rows), and plain text passes through directly. Binary content types are rejected.
-
HTML sanitization. Strips hidden elements, off-screen positioned content,
aria-hiddennodes,<noscript>tags, and non-printing Unicode. Returns a tally of everything removed. -
Content extraction. trafilatura converts sanitized HTML to markdown with link preservation.
-
Metadata extraction. Pulls title, author, date, description, canonical URL, and image from three sources in priority order: JSON-LD, Open Graph, then meta tags.
-
Link extraction. Two modes:
domainsreturns a sorted list of unique external domains,fullreturns all external URLs grouped by domain with anchor text. -
Injection scanning. Three-phase scan: original text against all 14 patterns, NFKC-normalized text for homoglyph bypasses, and decode-and-scan for base64/hex encoded payloads. Each match records the pattern name, severity (high/medium), and a 60-character context snippet.
-
Truncation. If
--max-wordsis set, the body is truncated after extraction but before output wrapping. -
Salt wrapping. The body gets wrapped in session-salted tags for defense-in-depth.
-
Output formatting. CLI produces five plaintext sections (status header, body, metadata, links, injection details). MCP server returns a structured JSON dict with the same data.
Output
CLI
Five sections, printed to stdout:
- Status header: URL, fetch timestamp, risk flag (
OKorINJECTION WARNING), sanitization tally, edge case info if detected - Body: clean markdown wrapped in
<fetch-content-{salt}>tags - Metadata: JSON block with title, author, date, description, canonical URL, image
- External links: domain list or full URL breakdown by domain
- Injection details: pattern name, severity, and context snippet for each match (only present when patterns detected)
MCP Server
Returns a structured dict:
url, fetched_at, body, content_type, metadata, links, links_mode,
risk_level, injection_matches, edge_cases, sanitization,
llms_txt_available, llms_txt_replaced, js_rendered, js_hint,
retried, truncated_at
When --strict is set and the risk level is HIGH, the CLI exits with code 2 and the MCP server raises an error response. The full result is still available in both cases.
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Fetch error (network failure, empty response, binary content) |
| 2 | High-risk injection detected (--strict only) |
Architecture
fetch_guard/
├── pipeline.py # Core orchestration — 13-step sequence, shared by CLI and server
├── cli.py # CLI entry point — arg parsing, pipeline call, output
├── server.py # MCP server — FastMCP wrapper over the same pipeline
│
├── http/ # HTTP fetching layer
│ ├── client.py # Static HTTP fetch via requests
│ ├── playwright.py # JS rendering via Playwright (optional)
│ └── llms_txt.py # /llms.txt preflight check
│
├── extraction/ # Content extraction and edge detection
│ ├── content.py # trafilatura wrapper — HTML to markdown
│ ├── content_type.py # Non-HTML routing — JSON, XML/RSS, CSV, plain text
│ ├── edges.py # Bot block, paywall, login wall classification
│ ├── links.py # External link extraction (domain list or full URLs)
│ └── metadata.py # JSON-LD, Open Graph, meta tag extraction
│
├── security/ # Injection defense
│ ├── guard.py # Salt generation, content wrapping, three-phase pattern scanning
│ ├── normalize.py # NFKC + confusable-character normalization for homoglyph detection
│ ├── patterns.py # 14 compiled regex patterns — single source of truth
│ └── sanitizer.py # Hidden element, CSS rule, and non-printing character removal
│
└── output/ # Formatting
└── formatter.py # CLI output assembly
Each module is a single-responsibility unit with a public function as its interface. pipeline.py is the shared core: both cli.py and server.py call pipeline.run() and handle the result in their own way.
Development
# Run tests (239 unit tests, all mocked — no network calls)
pytest
# Run live integration tests (hits real URLs)
pytest -m live
# Lint
ruff check fetch_guard/ tests/
CI runs on push and PR to main via GitHub Actions, testing against Python 3.10, 3.12, and 3.13.
Acknowledgements
Developed with Claude Code.
License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.