MCP Servers

doc-breach-mcp

A local MCP server that extracts clean Markdown from any developer portal using military-grade heuristics, bypassing WAFs, SPAs, and PDFs with zero SaaS dependencies or API keys.

README

<div align="center"> <h1>💥 DocBreach <sup>(MCP)</sup></h1> <p><b>The web is hostile to AI Agents. We brought a crowbar.</b></p> <br /> <p> <a href="#"><img src="https://img.shields.io/badge/SaaS_Dependencies-Zero-red?style=for-the-badge" /></a> <a href="#"><img src="https://img.shields.io/badge/WAF_Bypass-Automated-success?style=for-the-badge" /></a> <a href="#"><img src="https://img.shields.io/badge/Cost-$0.00_Forever-blue?style=for-the-badge" /></a> <a href="#"><img src="https://img.shields.io/badge/Transport-STDIO-black?style=for-the-badge" /></a> </p> <p> <a href="#-quickstart">Quickstart</a> · <a href="#%EF%B8%8F-the-weapon-guerrilla-architecture">Architecture</a> · <a href="#-the-uncomfortable-truth">Why This Exists</a> </p> <img width="1219" height="677" alt="image" src="https://github.com/user-attachments/assets/6e15e72d-d2ce-4b68-b031-b7b5305ab916" />

</div>

Using DocBreach in production? We want to hear about it — Join the Discord → WAF horror stories, edge cases, and what you're building. The founder is in there.

🛑 The Problem: Agentic Workflows Are Blind

We're in the era of autonomous AI Agents — but the web was built to repel bots, not serve them.

When your Claude, Cursor, or Windsurf tries to read an obscure API's documentation, it gets annihilated by:

Cloudflare WAFs throwing 403 CAPTCHAs at Node.js fetch().
Empty SPA shells (Next.js, Mintlify, GitBook) that render nothing without a $300M headless browser.
Legacy enterprise PDFs that crash the model's context window.
Login walls that lock public API references behind OAuth gates.
"AI-friendly" SaaS tools (Firecrawl, Jina, Context7) charging you $50/mo to read pages that are already public.

The LLM doesn't need a middleman. It needs raw signal.

⚔️ The Weapon: Guerrilla Architecture

DocBreach is a ruthless, 100% local MCP server. It doesn't ask for permission. It uses military-grade heuristics to extract clean, LLM-optimized Markdown from any developer portal — and it does it for free, forever.

Enemy Defense	DocBreach Tactical Override
🛡️ Cloudflare / WAF 403	Temporal Proxying — Hits a WAF? Silently pivots to the Wayback Machine. The docs from last week work just fine.
⚛️ JavaScript SPA Walls	Hydration Hijacking — Rips `__NEXT_DATA__`, `__NUXT__`, `__GITBOOK_STATE__` straight from the DOM. Zero JS engine needed.
🪟 Hidden iFrames	Source Chasing — Detects embedded Swagger/Postman/Stoplight apps, destroys the wrapper, resolves the true origin URL.
📄 Legacy PDF Manuals	Native Brute-Force — In-memory PDF parsing. Your AI reads 2004 banking manuals like they're GitHub READMEs.
🔐 Login Walls	Wall Detection — Identifies OAuth/SSO gates instantly and tells the agent to pivot to public alternatives.
🕳️ Ghost Town Sites	Self-Healing Errors — No docs found? DocBreach guides the agent to search GitHub repos, SDK source code, or `llms.txt` files.
💸 SaaS Scraping Taxes	Zero. Forever. Everything runs locally via Cheerio and Turndown. No API keys. No accounts. No telemetry.

"The LLM shouldn't be smart at scraping. It should be smart at coding. DocBreach handles the dirty work."

🚀 Quickstart

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "docbreach": {
      "command": "npx",
      "args": ["-y", "doc-breach-mcp"]
    }
  }
}

Cursor / Windsurf

Add to your MCP settings:

{
  "doc-breach": {
    "command": "npx",
    "args": ["-y", "doc-breach-mcp"]
  }
}

That's it. No API keys. No .env files. No sign-ups. It just works.

🧠 How It Thinks

DocBreach gives your AI agent 4 precision tools and lets the model drive:

You: "Integrate with the Datadog API and list all monitors"

Agent → docs.discover({ query: "datadog API" })
     ← Found: docs.datadoghq.com/api/latest/ (openapi)

Agent → docs.map({ domain: "docs.datadoghq.com" })
     ← 🗺️ Sitemap hierarchy, auto-generated Mermaid graph, and llms.txt discovery

Agent → docs.read({ url: "https://docs.datadoghq.com/api/latest/" })
     ← 📄 Clean Markdown + nav links + auth requirements

Agent → docs.extract({ url: "https://api.datadoghq.com/api/v2/openapi.yaml", tag: "monitors" })
     ← 📋 GET /api/v1/monitor — List all monitors
        GET /api/v1/monitor/{id} — Get a monitor's details
        POST /api/v1/monitor — Create a monitor
        ...

Agent: "I see the API requires DD-API-KEY and DD-APPLICATION-KEY headers,
        and you need to select a DD_SITE (US1, EU, US3, US5, AP1)..."

The model reasons. DocBreach retrieves. Nobody hallucinates.

The 11-Step Reader Pipeline

Every URL passes through a battle-hardened, 11-step extraction pipeline:

 URL
  │
  ├─ 1. Preflight ──────── HEAD check → Content-Type, size, reject >10MB
  ├─ 2. Fetch ──────────── GET + Wayback Machine fallback on 403/503
  ├─ 3. Login Detection ── OAuth/SSO wall? → abort + guide agent
  ├─ 4. Format Detection ─ OpenAPI? Postman? PDF? llms.txt? Markdown?
  ├─ 5. Binary Handling ── PDF <5MB → in-memory parse
  ├─ 6. Spec Summary ───── OpenAPI/Postman → structured Markdown
  ├─ 7. SPA Hydration ──── __NEXT_DATA__, __NUXT__, readme-data, GitBook
  ├─ 8. Nav Extraction ─── Sidebar links → absolute URLs
  ├─ 9. iFrame Intel ───── Swagger/Postman/Stoplight embed → true URL
  ├─ 10. HTML Cleaning ─── Cheerio → remove headers, footers, ads, nav
  └─ 11. Markdown ──────── Turndown + boundary-aware truncation
  │
  ▼
 Clean, LLM-ready Markdown

🪖 The Uncomfortable Truth

The developer tooling market has a parasite problem.

Companies like Firecrawl, Jina Reader, and Context7 take public documentation — pages that are freely accessible to any browser — wrap them in a proprietary API, and charge you a monthly subscription to access what was already yours.

They aren't adding value. They're adding a toll booth to the public internet.

DocBreach exists because:

Documentation is public. If a human can read it, an agent should too.
Scraping is a solved problem. Cheerio + Turndown have existed for a decade. You don't need a $20M startup to parse HTML.
Your AI runs locally. Why should it phone home to a SaaS to read a README?

This is not a product. This is a crowbar.

📊 DocBreach vs. The Toll Booths

	DocBreach	Firecrawl	Jina Reader	Context7
Cost	$0	$50+/mo	$30+/mo	Free (limited)
Runs locally	✅	❌ Cloud	❌ Cloud	❌ Cloud
No API keys	✅	❌	❌	❌
No telemetry	✅	❌	❌	❌
WAF bypass	✅ Wayback	✅ Paid proxy	❌	❌
SPA extraction	✅ Hydration	✅ Headless	❌	❌
PDF parsing	✅ Native	✅	❌	❌
OpenAPI extraction	✅	❌	❌	❌
HATEOAS navigation	✅	❌	❌	❌
Cognitive rules	✅	❌	❌	❌
Open source	✅ MIT	Partial	❌	✅

🔧 Tools Reference

`docs.discover`

Find documentation sources for any service, library, or API.

docs.discover({ query: "stripe webhooks API" })
// → [ { url, title, type: "openapi", source: "probe" }, ... ]

`docs.map`

Map the complete documentation structure of any domain. Extracts sitemaps, robots.txt, and llms.txt, returning an architectural blueprint.

docs.map({ domain: "docs.stripe.com" })
// → { total: 1200, sections: { "Root": [...], "API": [...] }, ... }

`docs.read`

Read any documentation URL and return clean, LLM-ready Markdown.

docs.read({ url: "https://docs.stripe.com/webhooks" })
// → { content: "# Webhooks\n\n...", nav_links: [...], format: "html" }

`docs.search`

Search for specific topics within a documentation site.

docs.search({ query: "authentication", site: "docs.stripe.com" })
// → [ { url: ".../authentication", title: "Authentication", ... } ]

`docs.extract`

Extract structured endpoint information from OpenAPI/Swagger/Postman specs.

docs.extract({ url: "https://api.stripe.com/openapi/spec.json", tag: "charges" })
// → [ { method: "POST", path: "/v1/charges", summary: "Create a charge" }, ... ]

🏆 Beyond the MCP Specification

Google and Anthropic's official MCP best practices ask for "Single Responsibility," "Clear Descriptions," and "Structured Error Handling." That is the bare minimum.

Thanks to Vurb.ts, DocBreach elevates these concepts to the tenth power, operating years ahead of the standard protocol:

MVA Architecture (Model → View → Agent): Standard MCP returns raw JSON strings. We route everything through Fluent Presenters acting as smart egress firewalls, stripping noise before the LLM ever sees it.
HATEOAS Navigation: Instead of the agent guessing what to do next, every DocBreach response includes a .suggestActions() payload telling the model exactly which tool to call next.
JIT System Rules: Dynamic instructions injected mid-flight based on payload context (e.g., "The content was truncated, use search").
Self-Healing Errors: Standard MCP throws an error. DocBreach returns an error and the exact prompt/tool required to recover from it.
Server-Side Mermaid UI: Sends native ui.mermaid() visual graphs to the MCP Inspector to help humans see the architecture the agent sees.
State Sync & Cache Control: Emits .cached() directives at the protocol level to eliminate duplicate requests and save LLM token context.

📄 License

MIT — because documentation should be free, and so should the tools that read it.

<div align="center"> <br /> <p><b>Stop paying rent to read public web pages.</b></p> <p>⭐ Star this repo if you agree.</p> </div>

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured