doc-index

doc-index

A local-first semantic search server for documents, supporting PDFs, Office files, and text/markdown, enabling natural language search via the Model Context Protocol (MCP).

Category
Visit Server

README

Doc Index MCP

What is This For?

A local-first semantic search server for your documents. Index PDFs, Word docs, PowerPoints, Excel files, and text/markdown, then search them using natural language via the Model Context Protocol (MCP).

  • Semantic search - Find relevant content using natural language queries
  • Boundary-aware chunking - Respects document structure (chapters, sections, headers)
  • Table extraction - Extract tables from documents as CSV
  • Fully local - No external APIs, no cloud services, no Docker containers, no PyTorch
  • Lightweight - ONNX-based embeddings (~50MB vs ~2GB for PyTorch)

Quick Start

1. Add to your MCP config

Requires uv. If you don't have uv, see Alternative Installation below.

Add to .mcp.json in your project root (for Claude Code) or your Claude Desktop config:

{
  "mcpServers": {
    "doc-index": {
      "command": "uvx",
      "args": ["doc-index-mcp"]
    }
  }
}

2. Install the Claude skill (optional)

The skill teaches Claude how to use the search tools effectively (token budgets, boundary expansion, etc.):

uvx --from doc-index-mcp doc-index-install-skill

That's it — start asking Claude to index and search your documents.

Supported Formats

Format Extensions Notes
Text .txt Plain text
Markdown .md, .markdown Preserves headers for boundaries
PDF .pdf Text extraction with page markers
Word .docx Paragraphs, headings, tables
PowerPoint .pptx Slides, notes, tables
Excel .xlsx, .xls Sheets as tables

Why No External Services?

Component Traditional RAG This Server
Embeddings OpenAI API / hosted model Local ONNX model (fastembed)
Vector DB Pinecone / Weaviate / Qdrant Local file (usearch)
Storage Cloud / managed DB Local .docindex/ directory
Dependencies PyTorch (~2GB) ONNX Runtime (~50MB)

Tools

doc_index

Index a document for semantic search.

{
  "file_path": "docs/manual.pdf",
  "source_name": "manual"
}

doc_search

Search indexed documents using natural language.

{
  "query": "how to configure authentication",
  "top_k": 5,
  "expand_to_boundary": "section",
  "max_return_tokens": 4096
}

Parameters:

  • query - Search query
  • sources - Filter to specific sources (optional)
  • top_k - Number of results (default: 5)
  • expand_to_boundary - Expand results to full "chapter", "section", "subsection", or "page"
  • max_return_tokens - Token budget for results (default: 4096)
  • include_siblings - Include sibling sections when expanding

doc_list

List all indexed sources.

doc_chunk

Retrieve a specific chunk by ID with optional neighbors.

{
  "chunk_id": "manual:42",
  "neighbors": 2
}

doc_toc

Get the table of contents (chapters, sections, subsections) for an indexed document. Use this to understand document structure before retrieving specific content.

{
  "source_name": "manual",
  "max_depth": 3
}

doc_get_content

Retrieve document content by structural location. Provide exactly one locator: boundary_id, chapter, section, or pages.

{
  "source_name": "manual",
  "chapter": "3",
  "max_return_tokens": 8192
}

read_document

Read a document without indexing. Returns formatted text.

{
  "file_path": "report.pdf",
  "max_chars": 100000
}

list_tables

List all tables in a document.

{
  "file_path": "data.xlsx"
}

extract_table

Extract a specific table as CSV.

{
  "file_path": "data.xlsx",
  "table_index": 0,
  "max_rows": 100
}

Environment Variables

Variable Description Default
MCP_WORKING_DIR Base directory for resolving file paths Current working directory
DOC_INDEX_DIR Directory for storing vector indices .docindex in working dir

Alternative Installation

Install globally with pip

pip install doc-index-mcp

Then in your .mcp.json:

{
  "mcpServers": {
    "doc-index": {
      "command": "doc-index-mcp"
    }
  }
}

Install from source

Clone the repo and install dependencies:

git clone https://github.com/mike-anderson/doc-index-mcp.git
cd doc-index-mcp
pip install -e .

Then point your .mcp.json at the server entrypoint:

{
  "mcpServers": {
    "doc-index": {
      "command": "python",
      "args": ["/path/to/doc-index-mcp/src/server.py"]
    }
  }
}

Architecture

Everything runs locally - no external APIs, databases, or embedding servers required.

flowchart TB
    subgraph Client["MCP Client (Claude Desktop, etc.)"]
        LLM[LLM]
    end

    subgraph MCP["Doc Index MCP Server"]
        Server[server.py]

        subgraph Services["Local Services"]
            Loader[Document Loader<br/>PDF, DOCX, PPTX, XLSX]
            Chunker[Boundary-Aware<br/>Chunker]
            Embedder[Embedder<br/>ONNX Runtime]
            VectorStore[Vector Store<br/>usearch]
        end
    end

    subgraph Storage["Local Filesystem"]
        Docs[(Source<br/>Documents)]
        Index[(".docindex/<br/>├── manifest.json<br/>└── vectors/<br/>    ├── index.usearch<br/>    ├── chunks.jsonl<br/>    └── boundaries.json")]
    end

    subgraph Models["Embedded Model (downloaded once)"]
        ONNX[BAAI/bge-small-en-v1.5<br/>ONNX format ~50MB]
    end

    LLM <-->|MCP Protocol| Server
    Server --> Loader
    Server --> Chunker
    Server --> Embedder
    Server --> VectorStore

    Loader -->|read| Docs
    VectorStore <-->|read/write| Index
    Embedder -->|load once| ONNX

    style Client fill:#e1f5fe
    style Storage fill:#fff3e0
    style Models fill:#f3e5f5
    style MCP fill:#e8f5e9

Data Flow

flowchart LR
    subgraph Index["Indexing"]
        direction TB
        A[Document] --> B[Load & Extract Text]
        B --> C[Detect Boundaries]
        C --> D[Chunk ~256 tokens]
        D --> E[Generate Embeddings]
        E --> F[Save to Disk]
    end

    subgraph Search["Searching"]
        direction TB
        G[Query] --> H[Embed Query]
        H --> I[Vector Similarity Search]
        I --> J[Expand to Boundaries]
        J --> K[Return Results]
    end

    Index -.->|stored in .docindex/| Search

License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured