pdf-search-mcp

pdf-search-mcp

MCP server for full-text search across PDF document collections with offline indexing, ranked results, snippets, and page rendering.

Category
Visit Server

README

pdf-search-mcp

MCP server for full-text search across PDF document collections. Built for AI agents — index once, search instantly from any MCP client.

  • Search entire collections — pre-indexes all PDFs for instant ranked results with snippets, not one file at a time
  • Fully offline — no API keys, no cloud services, just SQLite FTS5 and PyMuPDF
  • Page rendering — render pages as PNG for formulas, diagrams, and tables; crop to a region with auto-DPI scaling for detail shots
  • Dual renderer — CoreGraphics on macOS (sharper math fonts), PyMuPDF on Linux/Windows
  • German-aware — automatic expansion of ß↔ss, ä↔ae, ö↔oe, ü↔ue so both spellings match

Installation

From PyPI

pip install pdf-search-mcp

From source

git clone https://github.com/renvk/pdf-search-mcp.git
cd pdf-search-mcp
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python 3.10+. On macOS, pyobjc-framework-Quartz is installed automatically for native CoreGraphics PDF rendering (sharper formula and math font output). On Linux/Windows, PyMuPDF is used as the renderer.

Quick Start

1. Index your PDFs

PDF_SEARCH_DIR=/path/to/your/pdfs python -m pdf_search_mcp.pdf_search index

2. Register with your MCP client

The server runs over stdio. Example for Claude Code:

# project-scoped (only available in the current directory)
claude mcp add pdf-search -- pdf-search-mcp

# or global (available in all projects)
claude mcp add --scope global pdf-search -- pdf-search-mcp

For other MCP clients, add to your MCP config:

{
  "mcpServers": {
    "pdf-search": {
      "command": "pdf-search-mcp"
    }
  }
}

3. Search

Ask your AI agent to search your PDFs — it will use the search, read_page, and read_page_image tools automatically.

Configuration

Environment Variable Default Description
PDF_SEARCH_DIR (none) Path to your PDF directory (required for first index, remembered after)
PDF_SEARCH_DB ~/.local/share/pdf-search-mcp/pdf_index.db Path to the SQLite database file

CLI Usage

The pdf_search.py module doubles as a CLI for indexing and direct search:

# Build index (first time — PDF_SEARCH_DIR required)
PDF_SEARCH_DIR=/path/to/pdfs python -m pdf_search_mcp.pdf_search index

# Subsequent syncs (path remembered from first index)
python -m pdf_search_mcp.pdf_search index

# Search from command line
python -m pdf_search_mcp.pdf_search search "query terms"

# Read a specific page
python -m pdf_search_mcp.pdf_search read filename.pdf 5

# Show index statistics
python -m pdf_search_mcp.pdf_search stats

# Rebuild index from scratch (path remembered)
python -m pdf_search_mcp.pdf_search reindex

Search Syntax

Uses SQLite FTS5 query syntax:

Syntax Example Description
Terms distributed consensus Both terms must appear (implicit AND)
Phrase "garbage collection" Exact phrase match
OR mutex OR semaphore Either term
NOT cache NOT redis Exclude term
Prefix concur* Prefix matching
NEAR NEAR(load balancer, 10) Terms within 10 tokens of each other

Auto-quoting: Terms containing dots, hyphens, commas, or slashes are automatically quoted (e.g., ISO-27001 becomes "ISO-27001") because FTS5 treats these as token separators.

German expansion: Umlauts and eszett are automatically expanded to their digraph equivalents and vice versa (ß↔ss, ä↔ae, ö↔oe, ü↔ue). Searching for Größe also finds Groesse, and Weißbuch also finds Weissbuch.

Auto-relaxation: When a multi-term query returns no results (all terms must appear on the same page), the search automatically relaxes: first by dropping one term at a time to find the term blocking results, then by OR-ing all terms. A note in the output explains what was actually searched. Queries with explicit operators (AND, OR, NOT, NEAR) are not relaxed.

MCP Tools

Tool Parameters Description
search query, limit=10 Full-text search with ranked results and snippets
read_page filename, page, subfolder="" Read the full text of a specific page
read_page_image filename, page, dpi=140, region=None, subfolder="" Render a page (or cropped region) as PNG. region=[x1,y1,x2,y2] with 0.0–1.0 fractional coords to crop; DPI auto-scales for the cropped area
stats (none) Show index statistics (file count, pages, DB size, renderer)

Python API

from pdf_search_mcp import search_pdfs, read_pdf_page, render_pdf_page, index_pdfs

# Index PDFs
index_pdfs("/path/to/pdfs")

# Search
results = search_pdfs("garbage collection", limit=5)
for r in results:
    print(f"{r['subfolder']}/{r['file']} p.{r['page']}: {r['snippet']}")

# Read full page text
text = read_pdf_page("document.pdf", 42)

# Render full page as PNG
png_path = render_pdf_page("document.pdf", 42)

# Render cropped region (DPI auto-scales to maximize detail)
png_path = render_pdf_page("document.pdf", 42, region=[0.0, 0.5, 1.0, 0.8])

How It Works

  1. Indexing incrementally syncs your PDF directory into a SQLite FTS5 virtual table. On first run, all PDFs are indexed. On subsequent runs, only new, changed (by mtime/size), and deleted files are processed. Subdirectory names are preserved as a subfolder column for context. Directories starting with _ are skipped.

  2. Searching runs FTS5 MATCH queries and re-ranks results by combining BM25 relevance with match density — pages where search terms cluster together score higher than pages with the same terms scattered throughout. The density signal blends term concentration (matches per character) and spatial clustering (how tightly grouped the matches are).

  3. Reading re-opens the original PDF file on disk (path resolved via the stored pdf_dir metadata) for full page text or image rendering. Region crops auto-scale DPI to fill a 1568 px long-edge budget, maximizing detail without producing oversized images.

The database stores the text content only — original PDFs are accessed on disk for read_page and read_page_image. Rendering uses CoreGraphics on macOS and PyMuPDF elsewhere.

License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured