pdf-knowledge-mcp

pdf-knowledge-mcp

A local RAG MCP server for PDF development experience, enabling document ingestion, semantic search, and Q\&A with source citations using TF-IDF and cosine similarity.

Category
Visit Server

README

pdf-knowledge-mcp

pdf-knowledge-mcp 是一个面向 PDF 开发经验沉淀的本地 RAG MCP Server。它可以导入 PDF 解析、生成、渲染、文本提取、表格识别、版式分析、字体处理、OCR、PDF/A、签名、加密、性能优化等经验文档,并通过 MCP 工具提供检索和问答能力。

当前实现不依赖远程模型或外部向量数据库。文档会被切分为 chunk,使用本地 TF-IDF 向量和余弦相似度检索,并把索引持久化为 JSON 文件。后续可以在 src/knowledge-store.ts 中替换或扩展 embedding/provider。

安装与构建

cd C:\src\pdf-knowledge-mcp
npm install
npm run build

启动

npm start

该进程通过 stdio 提供 MCP 服务。

默认知识库索引路径为项目内的 data/pdf-knowledge-index.json。如需指定其他位置:

$env:PDF_KNOWLEDGE_STORE_PATH = "C:\src\pdf-knowledge-mcp\data\pdf-knowledge-index.json"
npm start

MCP 配置

推荐先链接成本地命令:

cd C:\src\pdf-knowledge-mcp
npm link

然后添加到 Codex:

codex mcp add pdf-knowledge -- pdf-knowledge-mcp

通用 MCP 客户端配置示例:

{
  "mcpServers": {
    "pdf-knowledge": {
      "command": "node",
      "args": ["C:/src/pdf-knowledge-mcp/dist/index.js"],
      "env": {
        "PDF_KNOWLEDGE_STORE_PATH": "C:/src/pdf-knowledge-mcp/data/pdf-knowledge-index.json"
      }
    }
  }
}

工具

ingest_document

导入 PDF 开发经验文档。支持直接传入 content,也支持传入 UTF-8 文本、Markdown、JSON、HTML 文件路径。

{
  "title": "PDF text extraction notes",
  "source": "notes/text-extraction.md",
  "tags": ["parsing", "text", "font"],
  "content": "When extracting text from PDF content streams, ToUnicode CMaps are essential...",
  "chunkSize": 1800,
  "chunkOverlap": 200,
  "replaceExisting": true
}

也可以从文件导入:

{
  "filePath": "C:/docs/pdf-rendering-notes.md",
  "tags": ["rendering", "performance"]
}

search_knowledge

基于本地向量索引检索相关经验片段。

{
  "query": "ToUnicode font extraction",
  "limit": 5,
  "tags": ["text"]
}

返回内容包括分数、文档标题、来源、标签、chunk id、命中词和 excerpt。

ask_pdf_expert

先检索知识库,再基于检索结果生成带来源的 PDF 开发回答。

{
  "question": "How should I handle fonts when extracting PDF text?",
  "limit": 5,
  "maxContextChars": 8000
}

如果没有匹配内容,工具会明确提示需要先导入相关经验文档。

验证

npm test

Smoke test 会验证:

  • 文档导入、分块和索引持久化;
  • 本地向量检索和标签过滤;
  • ask_pdf_expert 返回带来源的 RAG 回答;
  • MCP Server 可以通过 stdio 响应 initialize 请求。

说明

这个服务适合作为 PDF 开发经验知识库的基础版本:它先保证本地、可追溯、可运行。未来可以扩展的方向包括真实 embedding 模型、SQLite/向量数据库、PDF/Docx 文档解析器、自动目录同步、以及与 pdf-debug-mcppdf-specification-mcp 的联合查询。

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured