pdf-context
A local-first MCP server that ingests PDFs, extracts structure, and provides semantic search and sequential navigation tools for AI clients to query and learn from documents.
README
PDF Context Server
PDF Context (pdf-context on PyPI, import pdf_context) is a local-first library and MCP server that transforms PDF documents into structured, retrievable context for AI applications.
Drop PDFs into a watch folder, and the server ingests them automatically — extracting structure, classifying document type, chunking with awareness of chapters/sections, embedding locally, and exposing retrieval tools that AI clients use to teach, answer questions, or navigate documents sequentially.
Drop in PDFs. Build context once. Query from anywhere.
Install
From PyPI (when published):
pip install pdf-context
From source (development):
git clone https://github.com/yourusername/pdf-context-server.git
cd pdf-context-server
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
Console entry points:
pdf-context— developer CLI (ingest, search, smoke tests)pdf-context-mcp— MCP stdio server for AI clients
Programmatic API
from pdf_context import PdfContext, PdfContextConfig
config = PdfContextConfig(
pdf_data_dir="/path/to/pdfs",
storage_dir="/path/to/storage",
)
ctx = PdfContext(config, watch=False)
ctx.ingest("my-book.pdf")
results = ctx.search("virtual memory", document="my-book.pdf", top_k=5)
print(results["chunks"])
Each PdfContext instance is isolated: separate (pdf_dir, storage_dir) pairs get separate SQLite + Chroma indexes.
Features (v1)
- PDF ingestion with folder watch and background job queue
- Structure extraction from PDF outlines, heading heuristics, or page-level fallback
- Auto document classification:
textbook,technical_reference,paper,notes - Per-type retrieval profiles (chunk size, sequential vs semantic-first)
- Local embeddings (sentence-transformers default, Ollama optional)
- ChromaDB vector storage + SQLite metadata
- Semantic search with structure filters (chapter, section, page range)
- Sequential navigation for chapter-by-chapter learning
- MCP stdio server for any compatible AI client
- Production-grade local reliability: dedup, retries, checkpoints, resume
Architecture
PDF Documents (data/pdfs/)
│
▼
Folder Watcher ──► Job Queue (SQLite)
│
▼
Structure Extract + Classify + Parse + Chunk + Embed
│
├──► ChromaDB (vectors + metadata)
└──► SQLite (documents, structure, chunks, jobs)
│
▼
MCP Server (stdio)
├── NavigationalEngine (sequential / section content)
└── SemanticEngine (scoped semantic search)
│
▼
AI Client (Cursor, Claude Desktop, etc.)
Project Structure
pdf-context-server/
├── pdf_context/ # installable package
│ ├── client.py # PdfContext public API
│ ├── config.py # PdfContextConfig
│ ├── context.py # AppContext runtime
│ ├── cli.py # pdf-context CLI
│ ├── mcp/ # MCP factory + stdio entry
│ ├── classification/
│ ├── structure/
│ ├── parsers/
│ ├── chunking.py
│ ├── embeddings.py
│ ├── vector_store.py
│ ├── db/
│ ├── ingest/
│ ├── retrieval/
│ └── skills/ # bundled agent skills (CLI install)
├── app/ # deprecated shim (python -m app.main)
├── .cursor/
│ ├── mcp.json # project MCP config (example)
│ └── skills/pdf-context/
├── data/pdfs/
├── storage/
├── tests/
├── pyproject.toml
├── requirements.txt # dev convenience (see pyproject.toml)
├── .env.example
└── README.md
Installation (legacy dev clone)
See Install above. requirements.txt mirrors runtime deps; prefer pip install -e ".[dev]".
Quick test (no MCP required)
Drop a PDF in data/pdfs/, then run one command:
pdf-context smoke
That ingests all PDFs, runs a sample search, and prints PASS or FAIL with details.
Other useful commands:
pdf-context status
pdf-context list
pdf-context ingest
pdf-context ingest "my-book.pdf"
pdf-context search "virtual memory" -d "my-book.pdf"
pdf-context --pdf-dir /path/pdfs --storage-dir /path/storage status
pdf-context skill list
pdf-context skill install
pytest
Or with Make: make smoke, make status, make test.
MCP is for daily use in Cursor. The CLI is for verifying everything works without configuring or reloading MCP.
Adding Documents
Place PDFs in your configured PDF folder (default data/pdfs/):
data/pdfs/
├── operating-systems.pdf
├── api-reference.pdf
└── lecture-notes.pdf
Keep PDF and storage folders separate. pdf_data_dir and storage_dir must not be the same path, and neither may live inside the other. Mixing them causes the folder watcher to pick up Chroma/SQLite files, or ingest metadata into your PDF tree. Use sibling directories (defaults data/pdfs/ + storage/ are fine).
The folder watcher auto-enqueues new or changed PDFs for ingestion.
Optional type override sidecar:
data/pdfs/operating-systems.pdf.meta.json
{ "doc_type": "textbook" }
Valid types: textbook, technical_reference, paper, notes
MCP Setup
Enable pdf-context only in projects where PDFs are your source of truth. Avoid enabling it globally in Cursor user settings if most chats are code or general work—when the server is disconnected, the model cannot call PDF tools at all.
Add to project .cursor/mcp.json (Cursor) or your client's MCP config:
{
"mcpServers": {
"pdf-context": {
"command": "pdf-context-mcp",
"args": [
"--pdf-dir", "/absolute/path/to/pdfs",
"--storage-dir", "/absolute/path/to/storage"
]
}
}
}
No repo clone required after pip install pdf-context. For local dev, point command at .venv/bin/pdf-context-mcp.
Legacy (deprecated): "command": "python", "args": ["-m", "app.main"]
Use a descriptive server name (pdf-context, pdf-ml-book, pdf-papers) so rules and skills can refer to the right corpus.
Restart or reload MCP after changing config.
Multiple corpora (research vs papers)
Run one MCP process per (pdf folder, storage) pair. Example:
{
"mcpServers": {
"pdf-textbooks": {
"command": "pdf-context-mcp",
"args": [
"--pdf-dir", "/Users/me/books",
"--storage-dir", "/Users/me/.pdf-context/books",
"--instance-id", "textbooks"
]
},
"pdf-papers": {
"command": "pdf-context-mcp",
"args": [
"--pdf-dir", "/Users/me/papers",
"--storage-dir", "/Users/me/.pdf-context/papers",
"--instance-id", "papers"
]
}
}
}
Or via environment (PDF_CONTEXT_PDF_DATA_DIR, PDF_CONTEXT_STORAGE_DIR; legacy PDF_DATA_DIR / STORAGE_DIR still work):
"env": {
"PDF_CONTEXT_PDF_DATA_DIR": "/Users/me/books",
"PDF_CONTEXT_STORAGE_DIR": "/Users/me/.pdf-context/books"
}
When the AI client should call MCP tools
The model chooses tools from your message, tool descriptions, and project skills—it is not automatic. This project steers that behavior in three layers:
- Tool docstrings in
pdf_context/mcp/server.py— each tool states use when / do not use when. - Project skill —
.cursor/skills/pdf-context/SKILL.mdtells Cursor when to use pdf-context vs codebase tools. - Project-scoped MCP — enable the server only where PDFs matter.
| User intent | Expected tools |
|---|---|
| Fix code / git / tests | None (pdf-context idle) |
| "What does the book say about X?" | search_pdf_context (+ maybe list_documents) |
| Chapter walkthrough with cites | list_chapters, get_section_content, get_next_chunks |
| "Is my PDF indexed?" | get_ingest_status, list_documents |
| Casual chat | None |
Phrases that help: "From the indexed PDFs…", "Search [filename] for…", "Don't guess—use pdf-context."
Phrases that skip PDF tools: "In general (no PDF)", "Fix this Python file."
After pulling this repo, reload MCP so clients pick up new tool descriptions.
Install agent skill for any AI client
Bundled skills live in pdf_context/skills/. Install into Cursor, Claude Code, VS Code Copilot, Codex/AGENTS.md, Windsurf, Gemini, or a custom path:
pdf-context skill install
pdf-context skill list
pdf-context skill install -s pdf-context -c claude-code -p .
--client |
Writes to |
|---|---|
cursor-project |
.cursor/skills/pdf-context/SKILL.md |
cursor-global |
~/.cursor/skills/pdf-context/SKILL.md |
claude-code |
CLAUDE.md |
vscode-copilot |
.github/copilot-instructions.md |
codex-agents |
AGENTS.md |
windsurf |
.windsurfrules |
gemini |
GEMINI.md |
custom |
path from --output |
Markdown targets include marked blocks (<!-- pdf-context-skill:start/end -->) so re-running install can update the section without wiping your file.
MCP Tools
| Tool | Purpose |
|---|---|
list_documents |
Corpus check — what's indexed; call if unsure scope |
get_ingest_status |
Queue health; new PDFs; empty search debugging |
get_document_profile |
Doc type, retrieval profile, per-document guidance |
list_structure |
Full TOC tree |
list_chapters |
Flat chapter list (textbooks) |
get_section_content |
Ordered chunks for a chapter/section |
get_next_chunks |
Sequential read-ahead from cursor |
search_pdf_context |
Semantic search with optional structure filters |
set_document_type |
Override auto-classification (when user asks) |
reingest_document |
Force re-index (when user asks) |
Each tool's MCP description includes when to call it and when to skip it.
Document Types
| Type | Treatment |
|---|---|
| textbook | Sequential chapter navigation; larger chunks; chapter-scoped search |
| technical_reference | Semantic-first; section-scoped search; no forced sequential reading |
| paper | Section-scoped semantic search (abstract, methods, etc.) |
| notes | Weak structure; semantic-only; page-level fallback navigation |
Classification is automatic at ingest. Override via .meta.json or set_document_type.
Chapter-by-Chapter Learning Workflow
The AI client holds progress via the cursor returned by navigational tools.
1. get_document_profile("operating-systems.pdf")
2. list_chapters("operating-systems.pdf")
3. get_section_content("operating-systems.pdf", node_id=<chapter_id>, limit=5)
4. [Client teaches / summarizes from returned chunks]
5. search_pdf_context("page faults", document="operating-systems.pdf", chapter_id=<id>)
6. get_next_chunks("operating-systems.pdf", cursor=<last_cursor>, limit=5)
For unstructured notes, use list_structure and semantic search without sequential navigation.
Configuration
See .env.example. Key settings:
| Variable | Default | Description |
|---|---|---|
PDF_CONTEXT_PDF_DATA_DIR |
data/pdfs |
PDF watch folder (must not overlap storage) |
PDF_CONTEXT_STORAGE_DIR |
storage |
SQLite + Chroma (must not overlap PDF folder) |
PDF_CONTEXT_EMBEDDING_PROVIDER |
sentence_transformers |
or ollama |
PDF_CONTEXT_EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Local embedding model |
PDF_CONTEXT_WATCH_ENABLED |
true |
Auto-ingest on folder changes |
PDF_CONTEXT_CHECKPOINT_PAGE_INTERVAL |
50 |
Resume checkpoint during large ingests |
Legacy PDF_DATA_DIR / STORAGE_DIR (no prefix) are accepted for one release.
Path layout rule: After resolving to absolute paths, pdf_data_dir and storage_dir must differ and must not be nested (parent/child). Configuration is validated at startup when directories are created; invalid layouts raise a clear error.
First ingest of a large library (20+ textbooks, ~20k pages) on CPU may take hours. Checkpoints make ingestion resumable if interrupted.
Technology Stack
- Python 3.11+
- PyMuPDF — PDF parsing and outline extraction
- sentence-transformers — local embeddings
- ChromaDB — vector storage
- SQLite — metadata, structure, job queue
- MCP — AI client integration
- watchdog — folder watching
Development
pip install -e ".[dev]"
pytest
pdf-context --help
pdf-context-mcp --help
Vision
PDF Context Server converts static PDFs into structured, searchable knowledge that AI applications consume on demand — without re-uploading documents every session.
Retrieval, not synthesis: the server returns ranked chunks and structure metadata; your AI client generates answers, lessons, and summaries.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.