PDF RAG MCP Server
Enables RAG over messy PDFs — extract, chunk, embed, and search scanned, multi-column, and table-heavy documents.
README
PDF RAG MCP Server
MCP server for RAG over messy PDFs — extract, chunk, embed, and search scanned, multi-column, and table-heavy documents.
<p align="center"> <img src="docs/images/inspector-tools.png" alt="MCP Inspector — Tools" width="800"/> <br/> <em>All 6 tools running in the MCP Inspector</em> </p>
What is RAG?
RAG (Retrieval-Augmented Generation) is a technique that makes AI assistants smarter by giving them access to your own documents. Instead of relying only on training data, the AI first retrieves relevant chunks from your files, then uses them as context to generate accurate, grounded answers.
Traditional AI: User Question → LLM → Answer (may hallucinate)
RAG: User Question → Search Your Docs → LLM + Context → Accurate Answer
This MCP server is the "Search Your Docs" part — it ingests PDFs, breaks them into searchable chunks, and lets any MCP-compatible AI assistant find the right information instantly.
Why This Server?
Most PDF tools choke on real-world documents — scanned pages, multi-column layouts, embedded tables. This MCP server handles them all:
- Scanned PDFs — Automatic OCR via Tesseract when text extraction fails
- Multi-column layouts — Layout-preserving block sorting with PyMuPDF
- Tables — Detected and extracted as clean markdown via pdfplumber
- Semantic search — Find information by meaning, not just keywords
- 100% local — Embeddings run on your machine. No data leaves your system.
Demo
Ingest a PDF and search it
<p align="center"> <img src="docs/images/inspector-ingest.png" alt="PDF Ingest" width="800"/> <br/> <em>Ingesting a PDF — extracts text, chunks it, generates embeddings</em> </p>
<p align="center"> <img src="docs/images/inspector-search.png" alt="Semantic Search" width="800"/> <br/> <em>Semantic search returns ranked results with page numbers and similarity scores</em> </p>
Tool Examples
1. Ingest a PDF
Tool: pdf_ingest
Input: { "file_path": "/home/user/reports/ai-healthcare-2026.pdf" }
{
"doc_id": "a1b2c3d4e5f6",
"filename": "ai-healthcare-2026.pdf",
"total_pages": 4,
"total_chunks": 12,
"scanned_pages": [],
"status": "ingested"
}
2. Search across documents
Tool: pdf_search
Input: { "query": "drug discovery timelines", "limit": 3 }
{
"query": "drug discovery timelines",
"total_results": 3,
"results": [
{
"text": "Drug discovery timelines shortened by 30% using generative AI models...",
"score": 0.8742,
"page_num": 1,
"source_filename": "ai-healthcare-2026.pdf"
},
{
"text": "The healthcare AI market is experiencing rapid growth... Drug Discovery 8.7B...",
"score": 0.6521,
"page_num": 3,
"source_filename": "ai-healthcare-2026.pdf"
}
]
}
3. Extract tables as markdown
Tool: pdf_extract_tables
Input: { "file_path": "/home/user/reports/ai-healthcare-2026.pdf", "page_num": 3 }
{
"page_num": 3,
"tables_found": 1,
"markdown": "| Application | Market 2025 | Market 2030 |\n|---|---|---|\n| Diagnostic Imaging | $12.4B | $45.2B |\n| Drug Discovery | $8.7B | $32.1B |"
}
4. Get a specific page
Tool: pdf_get_page
Input: { "doc_id": "a1b2c3d4e5f6", "page_num": 1 }
{
"doc_id": "a1b2c3d4e5f6",
"page_num": 1,
"text": "Artificial Intelligence in Healthcare\nA Comprehensive Report - 2026\n\nExecutive Summary\nArtificial intelligence is transforming healthcare delivery..."
}
5. List & manage documents
Tool: pdf_list_documents
{
"total_documents": 2,
"documents": [
{ "doc_id": "a1b2c3d4e5f6", "source_filename": "ai-healthcare-2026.pdf", "total_chunks": 12, "total_pages": 4 },
{ "doc_id": "f6e5d4c3b2a1", "source_filename": "quarterly-report.pdf", "total_chunks": 45, "total_pages": 18 }
]
}
Tool: pdf_delete
Input: { "doc_id": "f6e5d4c3b2a1" }
→ { "doc_id": "f6e5d4c3b2a1", "chunks_deleted": 45, "status": "deleted" }
MCP Tools
| Tool | Description |
|---|---|
pdf_ingest |
Ingest a PDF: extract text (with OCR fallback), chunk, embed, and store |
pdf_search |
Semantic search across all ingested PDFs with similarity scores |
pdf_get_page |
Get full extracted text for a specific page |
pdf_list_documents |
List all ingested documents with metadata |
pdf_delete |
Remove a document and its embeddings from the store |
pdf_extract_tables |
Extract tables from a page as markdown |
Installation
Prerequisites
- Python 3.12+
- Tesseract OCR (for scanned PDF support)
# Ubuntu/Debian
sudo apt install tesseract-ocr
# macOS
brew install tesseract
Install from source
git clone https://github.com/MBaranekTech/pdf-rag-mcp.git
cd pdf-rag-mcp
uv venv .venv && source .venv/bin/activate
uv pip install -e .
Install from PyPI
pip install pdf-rag-mcp
Configuration
Claude Desktop
Add to ~/.config/Claude/claude_desktop_config.json (Linux) or ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):
{
"mcpServers": {
"pdf-rag": {
"command": "/path/to/pdf-rag-mcp/.venv/bin/pdf-rag-mcp"
}
}
}
Claude Code
claude mcp add pdf-rag -- /path/to/pdf-rag-mcp/.venv/bin/pdf-rag-mcp
Cursor / VS Code
Add to .cursor/mcp.json or VS Code MCP settings:
{
"mcpServers": {
"pdf-rag": {
"command": "/path/to/pdf-rag-mcp/.venv/bin/pdf-rag-mcp"
}
}
}
Docker
docker build -t pdf-rag-mcp .
docker run -v /path/to/pdfs:/pdfs pdf-rag-mcp
Architecture
PDF File
│
▼
┌─────────────────────────────────────────┐
│ pdf_extractor.py │
│ ┌───────────┐ ┌──────────────────┐ │
│ │ PyMuPDF │──▶│ Text extraction │ │
│ └───────────┘ │ (layout-aware) │ │
│ ┌───────────┐ ├──────────────────┤ │
│ │ Tesseract │──▶│ OCR fallback │ │
│ └───────────┘ │ (scanned pages) │ │
│ ┌───────────┐ ├──────────────────┤ │
│ │pdfplumber │──▶│ Table detection │ │
│ └───────────┘ └──────────────────┘ │
└──────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ chunker.py │
│ Split into ~500-word overlapping │
│ chunks with page number metadata │
└──────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ vector_store.py │
│ ┌────────────────────┐ ┌───────────┐ │
│ │ sentence-transformers│ │ ChromaDB │ │
│ │ (all-MiniLM-L6-v2) │─▶│ (cosine) │ │
│ └────────────────────┘ └───────────┘ │
└──────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ server.py (FastMCP) │
│ 6 tools exposed via MCP protocol │
└─────────────────────────────────────────┘
How It Works
- Ingest — PyMuPDF extracts text blocks sorted by position. Pages with < 50 characters of text are automatically OCR'd via Tesseract.
- Chunk — Text is split into ~500-word overlapping chunks (50-word overlap), preserving page number metadata.
- Embed — Chunks are embedded using
all-MiniLM-L6-v2(~80MB, runs locally, no API keys). - Store — Embeddings and metadata are persisted in ChromaDB at
~/.pdf-rag-mcp/chroma_db/. - Search — Queries are embedded and matched against stored chunks using cosine similarity.
Development
# Setup
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Test with MCP Inspector (requires Node.js)
fastmcp dev inspector src/pdf_rag_mcp/server.py:mcp --with-editable .
# Opens browser UI at http://localhost:6274
Tech Stack
| Component | Technology |
|---|---|
| MCP Framework | FastMCP |
| PDF Extraction | PyMuPDF |
| Table Extraction | pdfplumber |
| OCR | Tesseract via pytesseract |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
| Vector Store | ChromaDB |
Privacy
All processing happens locally:
- Embedding model runs on your machine
- PDF content is never sent to external APIs
- Data stored at
~/.pdf-rag-mcp/chroma_db/
License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.