doc-index
A local-first semantic search server for documents, supporting PDFs, Office files, and text/markdown, enabling natural language search via the Model Context Protocol (MCP).
README
Doc Index MCP
What is This For?
A local-first semantic search server for your documents. Index PDFs, Word docs, PowerPoints, Excel files, and text/markdown, then search them using natural language via the Model Context Protocol (MCP).
- Semantic search - Find relevant content using natural language queries
- Boundary-aware chunking - Respects document structure (chapters, sections, headers)
- Table extraction - Extract tables from documents as CSV
- Fully local - No external APIs, no cloud services, no Docker containers, no PyTorch
- Lightweight - ONNX-based embeddings (~50MB vs ~2GB for PyTorch)
Quick Start
1. Add to your MCP config
Requires uv. If you don't have uv, see Alternative Installation below.
Add to .mcp.json in your project root (for Claude Code) or your Claude Desktop config:
{
"mcpServers": {
"doc-index": {
"command": "uvx",
"args": ["doc-index-mcp"]
}
}
}
2. Install the Claude skill (optional)
The skill teaches Claude how to use the search tools effectively (token budgets, boundary expansion, etc.):
uvx --from doc-index-mcp doc-index-install-skill
That's it — start asking Claude to index and search your documents.
Supported Formats
| Format | Extensions | Notes |
|---|---|---|
| Text | .txt |
Plain text |
| Markdown | .md, .markdown |
Preserves headers for boundaries |
.pdf |
Text extraction with page markers | |
| Word | .docx |
Paragraphs, headings, tables |
| PowerPoint | .pptx |
Slides, notes, tables |
| Excel | .xlsx, .xls |
Sheets as tables |
Why No External Services?
| Component | Traditional RAG | This Server |
|---|---|---|
| Embeddings | OpenAI API / hosted model | Local ONNX model (fastembed) |
| Vector DB | Pinecone / Weaviate / Qdrant | Local file (usearch) |
| Storage | Cloud / managed DB | Local .docindex/ directory |
| Dependencies | PyTorch (~2GB) | ONNX Runtime (~50MB) |
Tools
doc_index
Index a document for semantic search.
{
"file_path": "docs/manual.pdf",
"source_name": "manual"
}
doc_search
Search indexed documents using natural language.
{
"query": "how to configure authentication",
"top_k": 5,
"expand_to_boundary": "section",
"max_return_tokens": 4096
}
Parameters:
query- Search querysources- Filter to specific sources (optional)top_k- Number of results (default: 5)expand_to_boundary- Expand results to full "chapter", "section", "subsection", or "page"max_return_tokens- Token budget for results (default: 4096)include_siblings- Include sibling sections when expanding
doc_list
List all indexed sources.
doc_chunk
Retrieve a specific chunk by ID with optional neighbors.
{
"chunk_id": "manual:42",
"neighbors": 2
}
doc_toc
Get the table of contents (chapters, sections, subsections) for an indexed document. Use this to understand document structure before retrieving specific content.
{
"source_name": "manual",
"max_depth": 3
}
doc_get_content
Retrieve document content by structural location. Provide exactly one locator: boundary_id, chapter, section, or pages.
{
"source_name": "manual",
"chapter": "3",
"max_return_tokens": 8192
}
read_document
Read a document without indexing. Returns formatted text.
{
"file_path": "report.pdf",
"max_chars": 100000
}
list_tables
List all tables in a document.
{
"file_path": "data.xlsx"
}
extract_table
Extract a specific table as CSV.
{
"file_path": "data.xlsx",
"table_index": 0,
"max_rows": 100
}
Environment Variables
| Variable | Description | Default |
|---|---|---|
MCP_WORKING_DIR |
Base directory for resolving file paths | Current working directory |
DOC_INDEX_DIR |
Directory for storing vector indices | .docindex in working dir |
Alternative Installation
Install globally with pip
pip install doc-index-mcp
Then in your .mcp.json:
{
"mcpServers": {
"doc-index": {
"command": "doc-index-mcp"
}
}
}
Install from source
Clone the repo and install dependencies:
git clone https://github.com/mike-anderson/doc-index-mcp.git
cd doc-index-mcp
pip install -e .
Then point your .mcp.json at the server entrypoint:
{
"mcpServers": {
"doc-index": {
"command": "python",
"args": ["/path/to/doc-index-mcp/src/server.py"]
}
}
}
Architecture
Everything runs locally - no external APIs, databases, or embedding servers required.
flowchart TB
subgraph Client["MCP Client (Claude Desktop, etc.)"]
LLM[LLM]
end
subgraph MCP["Doc Index MCP Server"]
Server[server.py]
subgraph Services["Local Services"]
Loader[Document Loader<br/>PDF, DOCX, PPTX, XLSX]
Chunker[Boundary-Aware<br/>Chunker]
Embedder[Embedder<br/>ONNX Runtime]
VectorStore[Vector Store<br/>usearch]
end
end
subgraph Storage["Local Filesystem"]
Docs[(Source<br/>Documents)]
Index[(".docindex/<br/>├── manifest.json<br/>└── vectors/<br/> ├── index.usearch<br/> ├── chunks.jsonl<br/> └── boundaries.json")]
end
subgraph Models["Embedded Model (downloaded once)"]
ONNX[BAAI/bge-small-en-v1.5<br/>ONNX format ~50MB]
end
LLM <-->|MCP Protocol| Server
Server --> Loader
Server --> Chunker
Server --> Embedder
Server --> VectorStore
Loader -->|read| Docs
VectorStore <-->|read/write| Index
Embedder -->|load once| ONNX
style Client fill:#e1f5fe
style Storage fill:#fff3e0
style Models fill:#f3e5f5
style MCP fill:#e8f5e9
Data Flow
flowchart LR
subgraph Index["Indexing"]
direction TB
A[Document] --> B[Load & Extract Text]
B --> C[Detect Boundaries]
C --> D[Chunk ~256 tokens]
D --> E[Generate Embeddings]
E --> F[Save to Disk]
end
subgraph Search["Searching"]
direction TB
G[Query] --> H[Embed Query]
H --> I[Vector Similarity Search]
I --> J[Expand to Boundaries]
J --> K[Return Results]
end
Index -.->|stored in .docindex/| Search
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.