PageIndex MCP

PageIndex MCP

A self-hosted MCP server for PageIndex's vectorless, reasoning-based document retrieval. It ingests PDFs into a hierarchical table of contents using an LLM and serves documents, structure, and page content via MCP tools.

Category
Visit Server

README

PageIndex MCP (self-hosted)

A self-hosted MCP server exposing PageIndex's vectorless, reasoning-based document retrieval. The pageindex/ directory is a vendored copy of VectifyAI's open-source PageIndex package (MIT licensed, see pageindex/LICENSE.upstream).

How it works:

  • Ingest (app/ingest.py): builds a hierarchical "table of contents" tree for a PDF using an LLM (Gemini Flash by default, configurable via PAGEINDEX_MODEL / pageindex/config.yaml + LiteLLM). This costs a small amount of LLM usage, once per document.
  • Hybrid OCR (app/ocr.py): pages whose embedded text layer is too sparse (scans, image-heavy slides) are rendered and transcribed by a vision model during ingest; born-digital text pages are read losslessly for free. The transcriptions are cached so retrieval serves them too. Disable with PAGEINDEX_OCR_MODEL=off.
  • Serve (app/server.py): exposes list_documents, get_document, get_document_structure, get_page_content as MCP tools over streamable HTTP, protected by a bearer token. The connecting agent (e.g. Claude) does the navigation/reasoning itself - serving is free after ingest.
  • Text files: anything that isn't a PDF (code, Jupyter notebooks, markdown, any UTF-8 file up to 10 MB) is stored as plain text without LLM ingest - instantly available, zero cost. The same MCP tools serve them, with 1-indexed line numbers taking the role of page numbers (e.g. get_page_content(doc_id, "1-200") returns the first 200 lines). Notebook outputs are stripped on upload; only markdown and code cells are kept.
  • Web UI (/): minimal document manager - create folders (projects), upload PDFs and text files (PDF ingest runs in background workers), watch queued/processing status, rename and delete documents. Clicking a document opens a detail view (description, dates) with a button to open the original PDF/file in a new tab. Unlock with the same bearer token; it is kept in the browser's localStorage.

Setup

  1. Copy .env.example to .env and fill in:

    • GEMINI_API_KEY - used only during ingest (tree building + OCR)
    • PAGEINDEX_MCP_API_KEY - bearer token clients must send, e.g. openssl rand -hex 32
    • optional: PAGEINDEX_MODEL (any LiteLLM model id for tree-building; default gemini/gemini-3.5-flash), PAGEINDEX_OCR_MODEL (vision model for text-poor pages, "off" to disable), PAGEINDEX_INGEST_WORKERS (parallel PDF ingests, default 2) and PAGEINDEX_MAX_CONCURRENT_LLM (global cap on simultaneous LLM calls, default 8 - protects against provider rate limits; if an ingest still hits a rate limit or the model is overloaded, it is re-queued automatically with a growing cooldown)
  2. Build and start:

    docker compose up -d --build
    
  3. Upload PDFs and text files via the web UI at https://<your-domain>/ (unlock with the PAGEINDEX_MCP_API_KEY). PDF ingest runs in the background; the list shows processing/done/failed per document. Text files are done immediately.

    Alternatively via CLI inside the container:

    docker compose exec pageindex-mcp python3 app/ingest.py /data/pdfs/lecture01.pdf --project "Machine Learning"
    

    Trees are saved to <data>/trees/<doc_id>.json and registered in <data>/documents.json.

Connecting an MCP client

{
  "mcpServers": {
    "pageindex-self": {
      "type": "http",
      "url": "https://<your-domain>/mcp",
      "headers": {
        "Authorization": "Bearer <PAGEINDEX_MCP_API_KEY>"
      }
    }
  }
}

For Claude Code:

claude mcp add --transport http pageindex-self https://<your-domain>/mcp \
  --header "Authorization: Bearer <PAGEINDEX_MCP_API_KEY>"

For opencode (~/.config/opencode/opencode.json, or a project-level opencode.json):

{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "pageindex-self": {
      "type": "remote",
      "url": "https://<your-domain>/mcp",
      "enabled": true,
      "headers": {
        "Authorization": "Bearer <PAGEINDEX_MCP_API_KEY>"
      }
    }
  }
}

To keep the token out of the config file, opencode supports env substitution: "Authorization": "Bearer {env:PAGEINDEX_MCP_API_KEY}".

Deployment

The compose file attaches the service to the external dokploy-network, so in Dokploy you only need to add a domain pointing at service pageindex-mcp, port 8000 (Traefik handles TLS). The container port is intentionally not published on the host - the bearer token must only travel over HTTPS.

For plain local use (no Dokploy), swap the networks section for the commented-out 127.0.0.1 port binding in docker-compose.yml.

GET /health is unauthenticated and returns ok - useful for uptime checks.

Persistence

../files/data/ (PDFs, generated trees, registry) is bind-mounted and persists across rebuilds/restarts. On Dokploy this is the app's files storage dir, which survives redeploys (the code dir does not). Back it up if you don't want to re-run ingest.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured