doc-breach-mcp

doc-breach-mcp

A local MCP server that extracts clean Markdown from any developer portal using military-grade heuristics, bypassing WAFs, SPAs, and PDFs with zero SaaS dependencies or API keys.

Category
Visit Server

README

<div align="center"> <h1>πŸ’₯ DocBreach <sup>(MCP)</sup></h1> <p><b>The web is hostile to AI Agents. We brought a crowbar.</b></p> <br /> <p> <a href="#"><img src="https://img.shields.io/badge/SaaS_Dependencies-Zero-red?style=for-the-badge" /></a> <a href="#"><img src="https://img.shields.io/badge/WAF_Bypass-Automated-success?style=for-the-badge" /></a> <a href="#"><img src="https://img.shields.io/badge/Cost-$0.00_Forever-blue?style=for-the-badge" /></a> <a href="#"><img src="https://img.shields.io/badge/Transport-STDIO-black?style=for-the-badge" /></a> </p> <p> <a href="#-quickstart">Quickstart</a> Β· <a href="#%EF%B8%8F-the-weapon-guerrilla-architecture">Architecture</a> Β· <a href="#-the-uncomfortable-truth">Why This Exists</a> </p> <img width="1219" height="677" alt="image" src="https://github.com/user-attachments/assets/6e15e72d-d2ce-4b68-b031-b7b5305ab916" />

</div>

Using DocBreach in production? We want to hear about it β€” Join the Discord β†’ WAF horror stories, edge cases, and what you're building. The founder is in there.


πŸ›‘ The Problem: Agentic Workflows Are Blind

We're in the era of autonomous AI Agents β€” but the web was built to repel bots, not serve them.

When your Claude, Cursor, or Windsurf tries to read an obscure API's documentation, it gets annihilated by:

  1. Cloudflare WAFs throwing 403 CAPTCHAs at Node.js fetch().
  2. Empty SPA shells (Next.js, Mintlify, GitBook) that render nothing without a $300M headless browser.
  3. Legacy enterprise PDFs that crash the model's context window.
  4. Login walls that lock public API references behind OAuth gates.
  5. "AI-friendly" SaaS tools (Firecrawl, Jina, Context7) charging you $50/mo to read pages that are already public.

The LLM doesn't need a middleman. It needs raw signal.


βš”οΈ The Weapon: Guerrilla Architecture

DocBreach is a ruthless, 100% local MCP server. It doesn't ask for permission. It uses military-grade heuristics to extract clean, LLM-optimized Markdown from any developer portal β€” and it does it for free, forever.

Enemy Defense DocBreach Tactical Override
πŸ›‘οΈ Cloudflare / WAF 403 Temporal Proxying β€” Hits a WAF? Silently pivots to the Wayback Machine. The docs from last week work just fine.
βš›οΈ JavaScript SPA Walls Hydration Hijacking β€” Rips __NEXT_DATA__, __NUXT__, __GITBOOK_STATE__ straight from the DOM. Zero JS engine needed.
πŸͺŸ Hidden iFrames Source Chasing β€” Detects embedded Swagger/Postman/Stoplight apps, destroys the wrapper, resolves the true origin URL.
πŸ“„ Legacy PDF Manuals Native Brute-Force β€” In-memory PDF parsing. Your AI reads 2004 banking manuals like they're GitHub READMEs.
πŸ” Login Walls Wall Detection β€” Identifies OAuth/SSO gates instantly and tells the agent to pivot to public alternatives.
πŸ•³οΈ Ghost Town Sites Self-Healing Errors β€” No docs found? DocBreach guides the agent to search GitHub repos, SDK source code, or llms.txt files.
πŸ’Έ SaaS Scraping Taxes Zero. Forever. Everything runs locally via Cheerio and Turndown. No API keys. No accounts. No telemetry.

"The LLM shouldn't be smart at scraping. It should be smart at coding. DocBreach handles the dirty work."


πŸš€ Quickstart

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "docbreach": {
      "command": "npx",
      "args": ["-y", "doc-breach-mcp"]
    }
  }
}

Cursor / Windsurf

Add to your MCP settings:

{
  "doc-breach": {
    "command": "npx",
    "args": ["-y", "doc-breach-mcp"]
  }
}

That's it. No API keys. No .env files. No sign-ups. It just works.


🧠 How It Thinks

DocBreach gives your AI agent 4 precision tools and lets the model drive:

You: "Integrate with the Datadog API and list all monitors"

Agent β†’ docs.discover({ query: "datadog API" })
     ← Found: docs.datadoghq.com/api/latest/ (openapi)

Agent β†’ docs.map({ domain: "docs.datadoghq.com" })
     ← πŸ—ΊοΈ Sitemap hierarchy, auto-generated Mermaid graph, and llms.txt discovery

Agent β†’ docs.read({ url: "https://docs.datadoghq.com/api/latest/" })
     ← πŸ“„ Clean Markdown + nav links + auth requirements

Agent β†’ docs.extract({ url: "https://api.datadoghq.com/api/v2/openapi.yaml", tag: "monitors" })
     ← πŸ“‹ GET /api/v1/monitor β€” List all monitors
        GET /api/v1/monitor/{id} β€” Get a monitor's details
        POST /api/v1/monitor β€” Create a monitor
        ...

Agent: "I see the API requires DD-API-KEY and DD-APPLICATION-KEY headers,
        and you need to select a DD_SITE (US1, EU, US3, US5, AP1)..."

The model reasons. DocBreach retrieves. Nobody hallucinates.

The 11-Step Reader Pipeline

Every URL passes through a battle-hardened, 11-step extraction pipeline:

 URL
  β”‚
  β”œβ”€ 1. Preflight ──────── HEAD check β†’ Content-Type, size, reject >10MB
  β”œβ”€ 2. Fetch ──────────── GET + Wayback Machine fallback on 403/503
  β”œβ”€ 3. Login Detection ── OAuth/SSO wall? β†’ abort + guide agent
  β”œβ”€ 4. Format Detection ─ OpenAPI? Postman? PDF? llms.txt? Markdown?
  β”œβ”€ 5. Binary Handling ── PDF <5MB β†’ in-memory parse
  β”œβ”€ 6. Spec Summary ───── OpenAPI/Postman β†’ structured Markdown
  β”œβ”€ 7. SPA Hydration ──── __NEXT_DATA__, __NUXT__, readme-data, GitBook
  β”œβ”€ 8. Nav Extraction ─── Sidebar links β†’ absolute URLs
  β”œβ”€ 9. iFrame Intel ───── Swagger/Postman/Stoplight embed β†’ true URL
  β”œβ”€ 10. HTML Cleaning ─── Cheerio β†’ remove headers, footers, ads, nav
  └─ 11. Markdown ──────── Turndown + boundary-aware truncation
  β”‚
  β–Ό
 Clean, LLM-ready Markdown

πŸͺ– The Uncomfortable Truth

The developer tooling market has a parasite problem.

Companies like Firecrawl, Jina Reader, and Context7 take public documentation β€” pages that are freely accessible to any browser β€” wrap them in a proprietary API, and charge you a monthly subscription to access what was already yours.

They aren't adding value. They're adding a toll booth to the public internet.

DocBreach exists because:

  • Documentation is public. If a human can read it, an agent should too.
  • Scraping is a solved problem. Cheerio + Turndown have existed for a decade. You don't need a $20M startup to parse HTML.
  • Your AI runs locally. Why should it phone home to a SaaS to read a README?

This is not a product. This is a crowbar.


πŸ“Š DocBreach vs. The Toll Booths

DocBreach Firecrawl Jina Reader Context7
Cost $0 $50+/mo $30+/mo Free (limited)
Runs locally βœ… ❌ Cloud ❌ Cloud ❌ Cloud
No API keys βœ… ❌ ❌ ❌
No telemetry βœ… ❌ ❌ ❌
WAF bypass βœ… Wayback βœ… Paid proxy ❌ ❌
SPA extraction βœ… Hydration βœ… Headless ❌ ❌
PDF parsing βœ… Native βœ… ❌ ❌
OpenAPI extraction βœ… ❌ ❌ ❌
HATEOAS navigation βœ… ❌ ❌ ❌
Cognitive rules βœ… ❌ ❌ ❌
Open source βœ… MIT Partial ❌ βœ…

πŸ”§ Tools Reference

docs.discover

Find documentation sources for any service, library, or API.

docs.discover({ query: "stripe webhooks API" })
// β†’ [ { url, title, type: "openapi", source: "probe" }, ... ]

docs.map

Map the complete documentation structure of any domain. Extracts sitemaps, robots.txt, and llms.txt, returning an architectural blueprint.

docs.map({ domain: "docs.stripe.com" })
// β†’ { total: 1200, sections: { "Root": [...], "API": [...] }, ... }

docs.read

Read any documentation URL and return clean, LLM-ready Markdown.

docs.read({ url: "https://docs.stripe.com/webhooks" })
// β†’ { content: "# Webhooks\n\n...", nav_links: [...], format: "html" }

docs.search

Search for specific topics within a documentation site.

docs.search({ query: "authentication", site: "docs.stripe.com" })
// β†’ [ { url: ".../authentication", title: "Authentication", ... } ]

docs.extract

Extract structured endpoint information from OpenAPI/Swagger/Postman specs.

docs.extract({ url: "https://api.stripe.com/openapi/spec.json", tag: "charges" })
// β†’ [ { method: "POST", path: "/v1/charges", summary: "Create a charge" }, ... ]

πŸ† Beyond the MCP Specification

Google and Anthropic's official MCP best practices ask for "Single Responsibility," "Clear Descriptions," and "Structured Error Handling." That is the bare minimum.

Thanks to Vurb.ts, DocBreach elevates these concepts to the tenth power, operating years ahead of the standard protocol:

  • MVA Architecture (Model β†’ View β†’ Agent): Standard MCP returns raw JSON strings. We route everything through Fluent Presenters acting as smart egress firewalls, stripping noise before the LLM ever sees it.
  • HATEOAS Navigation: Instead of the agent guessing what to do next, every DocBreach response includes a .suggestActions() payload telling the model exactly which tool to call next.
  • JIT System Rules: Dynamic instructions injected mid-flight based on payload context (e.g., "The content was truncated, use search").
  • Self-Healing Errors: Standard MCP throws an error. DocBreach returns an error and the exact prompt/tool required to recover from it.
  • Server-Side Mermaid UI: Sends native ui.mermaid() visual graphs to the MCP Inspector to help humans see the architecture the agent sees.
  • State Sync & Cache Control: Emits .cached() directives at the protocol level to eliminate duplicate requests and save LLM token context.

πŸ“„ License

MIT β€” because documentation should be free, and so should the tools that read it.


<div align="center"> <br /> <p><b>Stop paying rent to read public web pages.</b></p> <p>⭐ Star this repo if you agree.</p> </div>

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured