Enterprise Knowledge MCP Server

Enterprise Knowledge MCP Server

Enables querying enterprise documents (DOCX, PDF, PPTX) using natural language, with hybrid search and MCP integration for Claude Desktop and other agents.

Category
Visit Server

README

Enterprise Knowledge MCP Server

Unstructured Data Pipeline & Remote MCP Server — a production-ready enterprise document knowledge base. Ingests DOCX / PDF / PPTX, parses with Docling, cleans and chunks with metadata, indexes into a hybrid search store, and exposes a Remote MCP Server for Claude Desktop and other agents.

Status

Built incrementally, one step at a time (see CLAUDE.md for the full plan).

  • [x] Step 1 — Project Bootstrap: FastAPI + FastMCP + Docker + Pytest
  • [x] Step 2 — Document Upload: POST/GET /documents + persistent catalogue (DOCX/PDF/PPTX); upload now auto-runs the full pipeline and indexes immediately (no restart)
  • [x] Step 3 — Docling Parser: Docling -> structured ParsedDocument (headings/text/tables/figures, page & slide provenance)
  • [x] Step 4 — Cleaning Pipeline: strip repeated headers/footers, page numbers, empty/symbol-only noise (structure preserved)
  • [x] Step 5 — Metadata-aware Chunking: semantic chunks (section/table/figure) with full metadata; no fixed-width cuts
  • [x] Step 6 — Chroma Indexing: BGE dense embeddings into embedded persistent Chroma (index/search/get/delete)
  • [x] Step 7 — Hybrid Retrieval: dense (BGE) + sparse (BM25) fused with RRF; mixed CN/EN tokenizer
  • [x] Step 8 — MCP Tools: search_documents / list_documents / get_document / get_chunk on FastMCP
  • [x] Step 9 — MCP Resources: documents://all and documents://{document_id}
  • [x] Step 10 — Integration Test: end-to-end MCP protocol test (Client -> server -> tool/resource) + runnable client demo

Architecture (target)

DOCX / PDF / PPTX
    -> Docling Parser
    -> Cleaning Pipeline
    -> Metadata-aware Chunking
    -> Hybrid Search Index (BGE dense + BM25 sparse, Chroma)
    -> Remote MCP Server (FastMCP)
    -> Claude Desktop

Tech Stack

Area Choice
Language Python 3.11
API FastAPI
Parsing Docling
Search Hybrid Retrieval
Dense Retrieval BGE Embedding
Sparse Retrieval BM25
Vector DB Chroma (embedded)
MCP Framework FastMCP
Deployment Docker
Testing Pytest

Quick Start (local)

This repo ships a pre-created virtual environment (kb_mcp_env/, Windows).

# Install dependencies (incl. dev/test extras)
kb_mcp_env\Scripts\python.exe -m pip install -e ".[dev]"

# Run the tests
kb_mcp_env\Scripts\python.exe -m pytest -q

# Run the server
kb_mcp_env\Scripts\python.exe -m uvicorn app.main:app --reload

Run with Docker

cp .env.example .env   # optional
docker compose up --build

Brings up the app on port 8000. Chroma runs embedded in-process (no separate service); its data persists in the chroma_storage Docker volume.

Parsing & OCR

提醒:OCR 是在解析時對每張圖跑,26 張圖會增加數十秒解析時間。若某類文件不需要,可在 .envOCR_IMAGES=false 關閉。

提醒:預設 embedding 模型為 BAAI/bge-m3(多語,適合中英混雜,1024 維、約 2.2GB,首次會下載)。若只需英文且要更輕量,可在 .envEMBEDDING_MODEL=BAAI/bge-small-en-v1.5。切換模型若維度不同,需先清空 chroma_storage 重新索引。

Example Queries (target MCP tools)

What is the yield improvement plan?
Show me the KPI table from Q4 report.
Summarize slide 5.

Add a document (auto-indexed, no restart)

POST /documents saves the file and runs the full pipeline (Docling parse -> clean -> metadata-aware chunk -> BGE embed -> Chroma index) in the same process, then refreshes BM25. Because it shares the MCP server's vector-store/retriever singletons, the document is searchable over MCP immediately — no restart needed.

# server running on :8000
curl.exe -X POST http://127.0.0.1:8000/documents -F "file=@E:\path\to\report.pdf"
# -> 201 {"document_id": "...", "status": "indexed", "num_chunks": 42, ...}

The response carries status (indexed, or failed with HTTP 500 if parsing errors — the file is still recorded) and num_chunks. The call blocks until indexing finishes (Docling/OCR/embedding can take tens of seconds for large or image-heavy files). scripts/ingest_file.py shares the same pipeline for command-line ingestion.

Verify the MCP Server (client demo + server log)

Start the server, then drive it over the real MCP protocol with the bundled client demo:

# Terminal 1 — run the server
kb_mcp_env\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8000

# Terminal 2 — connect a remote MCP client and run the example queries
kb_mcp_env\Scripts\python.exe tests\mcp_client_demo.py
# ...or pass your own query:
kb_mcp_env\Scripts\python.exe tests\mcp_client_demo.py "yield improvement plan"

The demo connects (Client -> MCP Server -> search_documents -> Result), lists the server's tools/resources, reads documents://all, and prints the retrieved chunks. Meanwhile the server console logs each invocation:

INFO:app.mcp_server:MCP tool invoked: search_documents | query='...' top_k=3
INFO:app.mcp_server:search_documents retrieved 3 chunk(s): [...]

The hermetic equivalent (no running server, isolated temp index) is the pytest integration test:

kb_mcp_env\Scripts\python.exe -m pytest tests\test_mcp_integration.py -q

AI Workflow

This project is developed with an AI-only workflow (Claude Code + MCP). Each development step follows: plan -> implement -> review -> test, with a dedicated commit per step. See CLAUDE.md for the step-by-step record.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured