pdf-knowledge-mcp
A local RAG MCP server for PDF development experience, enabling document ingestion, semantic search, and Q\&A with source citations using TF-IDF and cosine similarity.
README
pdf-knowledge-mcp
pdf-knowledge-mcp 是一个面向 PDF 开发经验沉淀的本地 RAG MCP Server。它可以导入 PDF 解析、生成、渲染、文本提取、表格识别、版式分析、字体处理、OCR、PDF/A、签名、加密、性能优化等经验文档,并通过 MCP 工具提供检索和问答能力。
当前实现不依赖远程模型或外部向量数据库。文档会被切分为 chunk,使用本地 TF-IDF 向量和余弦相似度检索,并把索引持久化为 JSON 文件。后续可以在 src/knowledge-store.ts 中替换或扩展 embedding/provider。
安装与构建
cd C:\src\pdf-knowledge-mcp
npm install
npm run build
启动
npm start
该进程通过 stdio 提供 MCP 服务。
默认知识库索引路径为项目内的 data/pdf-knowledge-index.json。如需指定其他位置:
$env:PDF_KNOWLEDGE_STORE_PATH = "C:\src\pdf-knowledge-mcp\data\pdf-knowledge-index.json"
npm start
MCP 配置
推荐先链接成本地命令:
cd C:\src\pdf-knowledge-mcp
npm link
然后添加到 Codex:
codex mcp add pdf-knowledge -- pdf-knowledge-mcp
通用 MCP 客户端配置示例:
{
"mcpServers": {
"pdf-knowledge": {
"command": "node",
"args": ["C:/src/pdf-knowledge-mcp/dist/index.js"],
"env": {
"PDF_KNOWLEDGE_STORE_PATH": "C:/src/pdf-knowledge-mcp/data/pdf-knowledge-index.json"
}
}
}
}
工具
ingest_document
导入 PDF 开发经验文档。支持直接传入 content,也支持传入 UTF-8 文本、Markdown、JSON、HTML 文件路径。
{
"title": "PDF text extraction notes",
"source": "notes/text-extraction.md",
"tags": ["parsing", "text", "font"],
"content": "When extracting text from PDF content streams, ToUnicode CMaps are essential...",
"chunkSize": 1800,
"chunkOverlap": 200,
"replaceExisting": true
}
也可以从文件导入:
{
"filePath": "C:/docs/pdf-rendering-notes.md",
"tags": ["rendering", "performance"]
}
search_knowledge
基于本地向量索引检索相关经验片段。
{
"query": "ToUnicode font extraction",
"limit": 5,
"tags": ["text"]
}
返回内容包括分数、文档标题、来源、标签、chunk id、命中词和 excerpt。
ask_pdf_expert
先检索知识库,再基于检索结果生成带来源的 PDF 开发回答。
{
"question": "How should I handle fonts when extracting PDF text?",
"limit": 5,
"maxContextChars": 8000
}
如果没有匹配内容,工具会明确提示需要先导入相关经验文档。
验证
npm test
Smoke test 会验证:
- 文档导入、分块和索引持久化;
- 本地向量检索和标签过滤;
ask_pdf_expert返回带来源的 RAG 回答;- MCP Server 可以通过 stdio 响应
initialize请求。
说明
这个服务适合作为 PDF 开发经验知识库的基础版本:它先保证本地、可追溯、可运行。未来可以扩展的方向包括真实 embedding 模型、SQLite/向量数据库、PDF/Docx 文档解析器、自动目录同步、以及与 pdf-debug-mcp、pdf-specification-mcp 的联合查询。
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.