MCP Servers

PDF Knowledgebase MCP Server

A Model Context Protocol server that enables intelligent document search and retrieval from PDF collections, providing semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

README

PDF Knowledgebase MCP Server

A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

🚀 Quick Start
🏗️ Architecture Overview
🎯 Parser Selection Guide
⚙️ Configuration
🖥️ MCP Client Setup
📊 Performance & Troubleshooting
🔧 Advanced Configuration
📚 Appendix

🚀 Quick Start

Step 1: Install the Server

uvx pdfkb-mcp

Step 2: Configure Your MCP Client

Claude Desktop (Most Common):

Configuration file locations:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

VS Code (Native MCP) - Create .vscode/mcp.json in workspace:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Step 3: Verify Installation

Restart your MCP client completely
Check for PDF KB tools: Look for add_document, search_documents, list_documents, remove_document
Test functionality: Try adding a PDF and searching for content

🏗️ Architecture Overview

MCP Integration

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │    │   MCP Client     │    │   MCP Client    │
│ (Claude Desktop)│    │(VS Code/Continue)|    │   (Other)       │
└─────────┬───────┘    └─────────┬────────┘    └─────────┬───────┘
          │                      │                       │
          └──────────────────────┼───────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │    Model Context        │
                    │    Protocol (MCP)       │
                    │    Standard Layer       │
                    └────────────┬────────────┘
                                 │
          ┌──────────────────────┼───────────────────────┐
          │                      │                       │
┌─────────┴───────┐    ┌─────────┴────────┐    ┌─────────┴───────┐
│ PDF KB Server   │    │  Other MCP       │    │  Other MCP      │
│ (This Server)   │    │  Server          │    │  Server         │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Available Tools & Resources

Tools (Actions your client can perform):

add_document(path, metadata?) - Add PDF to knowledgebase
search_documents(query, limit=5, metadata_filter?) - Semantic search across PDFs
list_documents(metadata_filter?) - List all documents with metadata
remove_document(document_id) - Remove document from knowledgebase</search> </search_and_replace>

Resources (Data your client can access):

pdf://{document_id} - Full document content as JSON
pdf://{document_id}/page/{page_number} - Specific page content
pdf://list - List of all documents with metadata

🎯 Parser Selection Guide

Decision Tree

Document Type & Priority?
├── 🏃 Speed Priority → PyMuPDF4LLM (fastest processing, low memory)
├── 📚 Academic Papers → MinerU (fast with GPU, excellent formulas)
├── 📊 Business Reports → Docling (medium speed, best tables)
├── ⚖️ Balanced Quality → Marker (medium speed, good structure)
└── 🎯 Maximum Accuracy → LLM (slow, vision-based API calls)
```</search>
</search_and_replace>

### Performance Comparison

| Parser | Processing Speed | Memory | Text Quality | Table Quality | Best For |
|--------|------------------|--------|--------------|---------------|----------|
| **PyMuPDF4LLM** | **Fastest** | Low | Good | Basic | Speed priority |
| **MinerU** | Fast (with GPU) | High | Excellent | Excellent | Scientific papers |
| **Docling** | Medium | Medium | Excellent | **Excellent** | Business documents |
| **Marker** | Medium | Medium | Excellent | Good | **Balanced (default)** |
| **LLM** | Slow | Low | Excellent | Excellent | Maximum accuracy |</search>
</search_and_replace>

*Benchmarks from research studies and technical reports*

## ⚙️ Configuration

### Tier 1: Basic Configurations (80% of users)

**Default (Recommended)**:
```json
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "marker"
      },
      "transport": "stdio"
    }
  }
}

Speed Optimized:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "pymupdf4llm",
        "CHUNK_SIZE": "800"
      },
      "transport": "stdio"
    }
  }
}

Memory Efficient:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "pymupdf4llm",
        "EMBEDDING_BATCH_SIZE": "50"
      },
      "transport": "stdio"
    }
  }
}

Tier 2: Use Case Specific (15% of users)

Academic Papers:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "mineru",
        "CHUNK_SIZE": "1200"
      },
      "transport": "stdio"
    }
  }
}

Business Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "docling",
        "DOCLING_TABLE_MODE": "ACCURATE",
        "DOCLING_DO_TABLE_STRUCTURE": "true"
      },
      "transport": "stdio"
    }
  }
}

Multi-language Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "docling",
        "DOCLING_OCR_LANGUAGES": "en,fr,de,es",
        "DOCLING_DO_OCR": "true"
      },
      "transport": "stdio"
    }
  }
}

Maximum Quality:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDF_PARSER": "llm",
        "LLM_MODEL": "anthropic/claude-3.5-sonnet",
        "EMBEDDING_MODEL": "text-embedding-3-large"
      },
      "transport": "stdio"
    }
  }
}

Essential Environment Variables

Variable	Default	Description
`OPENAI_API_KEY`	required	OpenAI API key for embeddings
`KNOWLEDGEBASE_PATH`	`./pdfs`	Directory containing PDF files
`CACHE_DIR`	`./.cache`	Cache directory for processing
`PDF_PARSER`	`marker`	Parser: `marker`, `pymupdf4llm`, `mineru`, `docling`, `llm`
`CHUNK_SIZE`	`1000`	Target chunk size for LangChain chunker
`EMBEDDING_MODEL`	`text-embedding-3-large`	OpenAI embedding model

🖥️ MCP Client Setup

Claude Desktop

Configuration File Location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
        "CACHE_DIR": "/Users/yourname/Documents/PDFs/.cache"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

Verification:

Restart Claude Desktop completely
Look for PDF KB tools in the interface
Test with "Add a document" or "Search documents"

VS Code with Native MCP Support

Configuration (.vscode/mcp.json in workspace):

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

Reload VS Code window
Check VS Code's MCP server status in Command Palette
Use MCP tools in Copilot Chat

VS Code with Continue Extension

Configuration (.continue/config.json):

{
  "models": [...],
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

Reload VS Code window
Check Continue panel for server connection
Use @pdfkb in Continue chat</search> </search_and_replace>

Generic MCP Client

Standard Configuration Template:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "required",
        "KNOWLEDGEBASE_PATH": "required-absolute-path",
        "PDF_PARSER": "optional-default-marker"
      },
      "transport": "stdio",
      "autoRestart": true,
      "timeout": 30000
    }
  }
}

📊 Performance & Troubleshooting

Common Issues

Server not appearing in MCP client:

// ❌ Wrong: Missing transport
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"]
    }
  }
}

// ✅ Correct: Include transport and restart client
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "transport": "stdio"
    }
  }
}

Processing too slow:

// Switch to faster parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "pymupdf4llm"
      },
      "transport": "stdio"
    }
  }
}

Memory issues:

// Reduce memory usage
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "EMBEDDING_BATCH_SIZE": "25",
        "CHUNK_SIZE": "500"
      },
      "transport": "stdio"
    }
  }
}

Poor table extraction:

// Use table-optimized parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "docling",
        "DOCLING_TABLE_MODE": "ACCURATE"
      },
      "transport": "stdio"
    }
  }
}

Resource Requirements

Configuration	RAM Usage	Processing Speed	Best For
Speed	2-4 GB	Fastest	Large collections
Balanced	4-6 GB	Medium	Most users
Quality	6-12 GB	Medium-Fast	Accuracy priority
GPU	8-16 GB	Very Fast	High-volume processing

🔧 Advanced Configuration

Parser-Specific Options

MinerU Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "mineru",
        "MINERU_LANG": "en",
        "MINERU_METHOD": "auto",
        "MINERU_VRAM": "16"
      },
      "transport": "stdio"
    }
  }
}

LLM Parser Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDF_PARSER": "llm",
        "LLM_MODEL": "google/gemini-2.5-flash-lite",
        "LLM_CONCURRENCY": "5",
        "LLM_DPI": "150"
      },
      "transport": "stdio"
    }
  }
}

Performance Tuning

High-Performance Setup:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "mineru",
        "KNOWLEDGEBASE_PATH": "/Volumes/FastSSD/Documents/PDFs",
        "CACHE_DIR": "/Volumes/FastSSD/Documents/PDFs/.cache",
        "EMBEDDING_BATCH_SIZE": "200",
        "VECTOR_SEARCH_K": "15",
        "FILE_SCAN_INTERVAL": "30"
      },
      "transport": "stdio"
    }
  }
}

Intelligent Caching

The server uses multi-stage caching:

Parsing Cache: Stores converted markdown (src/pdfkb/intelligent_cache.py:139)
Chunking Cache: Stores processed chunks
Vector Cache: ChromaDB embeddings storage

Cache Invalidation Rules:

Changing PDF_PARSER → Full reset (parsing + chunking + embeddings)
Changing PDF_CHUNKER → Partial reset (chunking + embeddings)
Changing EMBEDDING_MODEL → Minimal reset (embeddings only)

📚 Appendix

Installation Options

Primary (Recommended):

uvx pdfkb-mcp

With Specific Parser Dependencies:

uvx pdfkb-mcp[marker]     # Marker parser
uvx pdfkb-mcp[mineru]     # MinerU parser
uvx pdfkb-mcp[docling]    # Docling parser
uvx pdfkb-mcp[llm]        # LLM parser
uvx pdfkb-mcp[langchain]  # LangChain chunker

Development Installation:

git clone https://github.com/juanqui/pdfkb-mcp.git
cd pdfkb-mcp
pip install -e ".[dev]"

Complete Environment Variables Reference

Variable	Default	Description
`OPENAI_API_KEY`	required	OpenAI API key for embeddings
`OPENROUTER_API_KEY`	optional	Required for LLM parser
`KNOWLEDGEBASE_PATH`	`./pdfs`	PDF directory path
`CACHE_DIR`	`./.cache`	Cache directory
`PDF_PARSER`	`marker`	PDF parser selection
`PDF_CHUNKER`	`unstructured`	Chunking strategy
`CHUNK_SIZE`	`1000`	LangChain chunk size
`CHUNK_OVERLAP`	`200`	LangChain chunk overlap
`EMBEDDING_MODEL`	`text-embedding-3-large`	OpenAI model
`EMBEDDING_BATCH_SIZE`	`100`	Embedding batch size
`VECTOR_SEARCH_K`	`5`	Default search results
`FILE_SCAN_INTERVAL`	`60`	File monitoring interval
`LOG_LEVEL`	`INFO`	Logging level

Parser Comparison Details

Feature	PyMuPDF4LLM	Marker	MinerU	Docling	LLM
Speed	Fastest	Medium	Fast (GPU)	Medium	Slowest
Memory	Lowest	Medium	High	Medium	Lowest
Tables	Basic	Good	Excellent	Excellent	Excellent
Formulas	Basic	Good	Excellent	Good	Excellent
Images	Basic	Good	Good	Excellent	Excellent
Setup	Simple	Simple	Moderate	Simple	Simple
Cost	Free	Free	Free	Free	API costs

Chunking Strategies

LangChain (PDF_CHUNKER=langchain):

Header-aware splitting with MarkdownHeaderTextSplitter
Configurable via CHUNK_SIZE and CHUNK_OVERLAP
Best for customizable chunking

Unstructured (PDF_CHUNKER=unstructured):

Intelligent semantic chunking with unstructured library
Zero configuration required
Best for document structure awareness

Troubleshooting Guide

API Key Issues:

Verify key format starts with sk-
Check account has sufficient credits
Test connectivity: curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

Parser Installation Issues:

MinerU: pip install mineru[all] and verify mineru --version
Docling: pip install docling for basic, pip install pdfkb-mcp[docling-complete] for all features
LLM: Requires OPENROUTER_API_KEY environment variable

Performance Optimization:

Speed: Use pymupdf4llm parser
Memory: Reduce EMBEDDING_BATCH_SIZE and CHUNK_SIZE
Quality: Use mineru (GPU) or docling (CPU)
Tables: Use docling with DOCLING_TABLE_MODE=ACCURATE

For additional support, see implementation details in src/pdfkb/main.py and src/pdfkb/config.py.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

PDF Knowledgebase MCP Server

README

PDF Knowledgebase MCP Server

Table of Contents

🚀 Quick Start

Step 1: Install the Server

Step 2: Configure Your MCP Client

Step 3: Verify Installation

🏗️ Architecture Overview

MCP Integration

Available Tools & Resources

🎯 Parser Selection Guide

Decision Tree

Tier 2: Use Case Specific (15% of users)

Essential Environment Variables

🖥️ MCP Client Setup

Claude Desktop

VS Code with Native MCP Support

VS Code with Continue Extension

Generic MCP Client

📊 Performance & Troubleshooting

Common Issues

Resource Requirements

🔧 Advanced Configuration

Parser-Specific Options

Performance Tuning

Intelligent Caching

📚 Appendix

Installation Options

Complete Environment Variables Reference

Parser Comparison Details

Chunking Strategies

Troubleshooting Guide

Recommended Servers