PDF Knowledgebase MCP Server

PDF Knowledgebase MCP Server

A Model Context Protocol server that enables intelligent document search and retrieval from PDF collections, providing semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

Category
Visit Server

README

PDF Knowledgebase MCP Server

A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

Table of Contents

🚀 Quick Start

Step 1: Install the Server

uvx pdfkb-mcp

Step 2: Configure Your MCP Client

Claude Desktop (Most Common):

Configuration file locations:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

VS Code (Native MCP) - Create .vscode/mcp.json in workspace:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Step 3: Verify Installation

  1. Restart your MCP client completely
  2. Check for PDF KB tools: Look for add_document, search_documents, list_documents, remove_document
  3. Test functionality: Try adding a PDF and searching for content

🏗️ Architecture Overview

MCP Integration

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │    │   MCP Client     │    │   MCP Client    │
│ (Claude Desktop)│    │(VS Code/Continue)|    │   (Other)       │
└─────────┬───────┘    └─────────┬────────┘    └─────────┬───────┘
          │                      │                       │
          └──────────────────────┼───────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │    Model Context        │
                    │    Protocol (MCP)       │
                    │    Standard Layer       │
                    └────────────┬────────────┘
                                 │
          ┌──────────────────────┼───────────────────────┐
          │                      │                       │
┌─────────┴───────┐    ┌─────────┴────────┐    ┌─────────┴───────┐
│ PDF KB Server   │    │  Other MCP       │    │  Other MCP      │
│ (This Server)   │    │  Server          │    │  Server         │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Available Tools & Resources

Tools (Actions your client can perform):

Resources (Data your client can access):

  • pdf://{document_id} - Full document content as JSON
  • pdf://{document_id}/page/{page_number} - Specific page content
  • pdf://list - List of all documents with metadata

🎯 Parser Selection Guide

Decision Tree

Document Type & Priority?
├── 🏃 Speed Priority → PyMuPDF4LLM (fastest processing, low memory)
├── 📚 Academic Papers → MinerU (fast with GPU, excellent formulas)
├── 📊 Business Reports → Docling (medium speed, best tables)
├── ⚖️ Balanced Quality → Marker (medium speed, good structure)
└── 🎯 Maximum Accuracy → LLM (slow, vision-based API calls)
```</search>
</search_and_replace>

### Performance Comparison

| Parser | Processing Speed | Memory | Text Quality | Table Quality | Best For |
|--------|------------------|--------|--------------|---------------|----------|
| **PyMuPDF4LLM** | **Fastest** | Low | Good | Basic | Speed priority |
| **MinerU** | Fast (with GPU) | High | Excellent | Excellent | Scientific papers |
| **Docling** | Medium | Medium | Excellent | **Excellent** | Business documents |
| **Marker** | Medium | Medium | Excellent | Good | **Balanced (default)** |
| **LLM** | Slow | Low | Excellent | Excellent | Maximum accuracy |</search>
</search_and_replace>

*Benchmarks from research studies and technical reports*

## ⚙️ Configuration

### Tier 1: Basic Configurations (80% of users)

**Default (Recommended)**:
```json
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "marker"
      },
      "transport": "stdio"
    }
  }
}

Speed Optimized:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "pymupdf4llm",
        "CHUNK_SIZE": "800"
      },
      "transport": "stdio"
    }
  }
}

Memory Efficient:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "pymupdf4llm",
        "EMBEDDING_BATCH_SIZE": "50"
      },
      "transport": "stdio"
    }
  }
}

Tier 2: Use Case Specific (15% of users)

Academic Papers:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "mineru",
        "CHUNK_SIZE": "1200"
      },
      "transport": "stdio"
    }
  }
}

Business Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "docling",
        "DOCLING_TABLE_MODE": "ACCURATE",
        "DOCLING_DO_TABLE_STRUCTURE": "true"
      },
      "transport": "stdio"
    }
  }
}

Multi-language Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "docling",
        "DOCLING_OCR_LANGUAGES": "en,fr,de,es",
        "DOCLING_DO_OCR": "true"
      },
      "transport": "stdio"
    }
  }
}

Maximum Quality:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDF_PARSER": "llm",
        "LLM_MODEL": "anthropic/claude-3.5-sonnet",
        "EMBEDDING_MODEL": "text-embedding-3-large"
      },
      "transport": "stdio"
    }
  }
}

Essential Environment Variables

Variable Default Description
OPENAI_API_KEY required OpenAI API key for embeddings
KNOWLEDGEBASE_PATH ./pdfs Directory containing PDF files
CACHE_DIR ./.cache Cache directory for processing
PDF_PARSER marker Parser: marker, pymupdf4llm, mineru, docling, llm
CHUNK_SIZE 1000 Target chunk size for LangChain chunker
EMBEDDING_MODEL text-embedding-3-large OpenAI embedding model

🖥️ MCP Client Setup

Claude Desktop

Configuration File Location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
        "CACHE_DIR": "/Users/yourname/Documents/PDFs/.cache"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

Verification:

  1. Restart Claude Desktop completely
  2. Look for PDF KB tools in the interface
  3. Test with "Add a document" or "Search documents"

VS Code with Native MCP Support

Configuration (.vscode/mcp.json in workspace):

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

  1. Reload VS Code window
  2. Check VS Code's MCP server status in Command Palette
  3. Use MCP tools in Copilot Chat

VS Code with Continue Extension

Configuration (.continue/config.json):

{
  "models": [...],
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

  1. Reload VS Code window
  2. Check Continue panel for server connection
  3. Use @pdfkb in Continue chat</search> </search_and_replace>

Generic MCP Client

Standard Configuration Template:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "required",
        "KNOWLEDGEBASE_PATH": "required-absolute-path",
        "PDF_PARSER": "optional-default-marker"
      },
      "transport": "stdio",
      "autoRestart": true,
      "timeout": 30000
    }
  }
}

📊 Performance & Troubleshooting

Common Issues

Server not appearing in MCP client:

// ❌ Wrong: Missing transport
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"]
    }
  }
}

// ✅ Correct: Include transport and restart client
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "transport": "stdio"
    }
  }
}

Processing too slow:

// Switch to faster parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "pymupdf4llm"
      },
      "transport": "stdio"
    }
  }
}

Memory issues:

// Reduce memory usage
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "EMBEDDING_BATCH_SIZE": "25",
        "CHUNK_SIZE": "500"
      },
      "transport": "stdio"
    }
  }
}

Poor table extraction:

// Use table-optimized parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "docling",
        "DOCLING_TABLE_MODE": "ACCURATE"
      },
      "transport": "stdio"
    }
  }
}

Resource Requirements

Configuration RAM Usage Processing Speed Best For
Speed 2-4 GB Fastest Large collections
Balanced 4-6 GB Medium Most users
Quality 6-12 GB Medium-Fast Accuracy priority
GPU 8-16 GB Very Fast High-volume processing

🔧 Advanced Configuration

Parser-Specific Options

MinerU Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "mineru",
        "MINERU_LANG": "en",
        "MINERU_METHOD": "auto",
        "MINERU_VRAM": "16"
      },
      "transport": "stdio"
    }
  }
}

LLM Parser Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDF_PARSER": "llm",
        "LLM_MODEL": "google/gemini-2.5-flash-lite",
        "LLM_CONCURRENCY": "5",
        "LLM_DPI": "150"
      },
      "transport": "stdio"
    }
  }
}

Performance Tuning

High-Performance Setup:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "mineru",
        "KNOWLEDGEBASE_PATH": "/Volumes/FastSSD/Documents/PDFs",
        "CACHE_DIR": "/Volumes/FastSSD/Documents/PDFs/.cache",
        "EMBEDDING_BATCH_SIZE": "200",
        "VECTOR_SEARCH_K": "15",
        "FILE_SCAN_INTERVAL": "30"
      },
      "transport": "stdio"
    }
  }
}

Intelligent Caching

The server uses multi-stage caching:

Cache Invalidation Rules:

  • Changing PDF_PARSER → Full reset (parsing + chunking + embeddings)
  • Changing PDF_CHUNKER → Partial reset (chunking + embeddings)
  • Changing EMBEDDING_MODEL → Minimal reset (embeddings only)

📚 Appendix

Installation Options

Primary (Recommended):

uvx pdfkb-mcp

With Specific Parser Dependencies:

uvx pdfkb-mcp[marker]     # Marker parser
uvx pdfkb-mcp[mineru]     # MinerU parser
uvx pdfkb-mcp[docling]    # Docling parser
uvx pdfkb-mcp[llm]        # LLM parser
uvx pdfkb-mcp[langchain]  # LangChain chunker

Development Installation:

git clone https://github.com/juanqui/pdfkb-mcp.git
cd pdfkb-mcp
pip install -e ".[dev]"

Complete Environment Variables Reference

Variable Default Description
OPENAI_API_KEY required OpenAI API key for embeddings
OPENROUTER_API_KEY optional Required for LLM parser
KNOWLEDGEBASE_PATH ./pdfs PDF directory path
CACHE_DIR ./.cache Cache directory
PDF_PARSER marker PDF parser selection
PDF_CHUNKER unstructured Chunking strategy
CHUNK_SIZE 1000 LangChain chunk size
CHUNK_OVERLAP 200 LangChain chunk overlap
EMBEDDING_MODEL text-embedding-3-large OpenAI model
EMBEDDING_BATCH_SIZE 100 Embedding batch size
VECTOR_SEARCH_K 5 Default search results
FILE_SCAN_INTERVAL 60 File monitoring interval
LOG_LEVEL INFO Logging level

Parser Comparison Details

Feature PyMuPDF4LLM Marker MinerU Docling LLM
Speed Fastest Medium Fast (GPU) Medium Slowest
Memory Lowest Medium High Medium Lowest
Tables Basic Good Excellent Excellent Excellent
Formulas Basic Good Excellent Good Excellent
Images Basic Good Good Excellent Excellent
Setup Simple Simple Moderate Simple Simple
Cost Free Free Free Free API costs

Chunking Strategies

LangChain (PDF_CHUNKER=langchain):

  • Header-aware splitting with MarkdownHeaderTextSplitter
  • Configurable via CHUNK_SIZE and CHUNK_OVERLAP
  • Best for customizable chunking

Unstructured (PDF_CHUNKER=unstructured):

  • Intelligent semantic chunking with unstructured library
  • Zero configuration required
  • Best for document structure awareness

Troubleshooting Guide

API Key Issues:

  1. Verify key format starts with sk-
  2. Check account has sufficient credits
  3. Test connectivity: curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

Parser Installation Issues:

  1. MinerU: pip install mineru[all] and verify mineru --version
  2. Docling: pip install docling for basic, pip install pdfkb-mcp[docling-complete] for all features
  3. LLM: Requires OPENROUTER_API_KEY environment variable

Performance Optimization:

  1. Speed: Use pymupdf4llm parser
  2. Memory: Reduce EMBEDDING_BATCH_SIZE and CHUNK_SIZE
  3. Quality: Use mineru (GPU) or docling (CPU)
  4. Tables: Use docling with DOCLING_TABLE_MODE=ACCURATE

For additional support, see implementation details in src/pdfkb/main.py and src/pdfkb/config.py.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured