MCP Servers

Modular RAG MCP Server

Enables building production-grade RAG systems with agentic reasoning, hybrid retrieval, and MCP protocol integration for use with Claude Desktop.

README

Modular RAG MCP Server

生产级 Agentic RAG 系统 — ReAct Agent · 混合检索 · MCP 协议 · 全链路可观测性

A production-grade Agentic RAG framework built from scratch. Features a ReAct Agent with self-checking, Hybrid Search (Dense + BM25 + RRF), Model Context Protocol (MCP) server compatible with Claude Desktop, and full observability via Streamlit Dashboard.

Benchmark Results

21-query bilingual test set (Chinese + English technical docs, 70 chunks):

Retrieval Mode	Hit@1	Hit@5	MRR@10	Avg Latency
Dense Only (BGE-m3)	66.7%	100%	0.794	315 ms
Sparse Only (BM25)	90.5%	100%	0.952	14 ms
Hybrid / RRF Fusion	76.2%	100%	0.881	259 ms

All modes achieve Hit@5 = 100%. Full methodology in EVALUATION_REPORT.md.

Architecture

┌──────────────────────────────────────────────────────────────┐
│               User / Claude Desktop / CLI                    │
└───────────────┬──────────────────────────┬───────────────────┘
                │ MCP JSON-RPC             │ Streamlit
                ▼                          ▼
    ┌───────────────────┐     ┌────────────────────────────┐
    │    MCP Server     │     │    Observability Dashboard │
    │ (stdio transport) │     │  Overview · Agent Chat ·   │
    │  query_knowledge  │     │  Ingestion · Traces · Eval │
    └────────┬──────────┘     └────────────────────────────┘
             │
             ▼
    ┌───────────────────────────────────────────────────────┐
    │                    ReAct Agent                        │
    │  ┌──────────────┐  ┌───────────┐  ┌───────────────┐  │
    │  │ Tool Registry│  │SelfChecker│  │  Conversation │  │
    │  │ 5 RAG tools  │  │(LLM judge)│  │    Memory     │  │
    │  └──────────────┘  └───────────┘  └───────────────┘  │
    └────────┬──────────────────────────────────────────────┘
             │
             ▼
    ┌───────────────────────────────────────────────────────┐
    │                    RAG Core                           │
    │  Dense Search    BM25 Search      Reranker            │
    │  (ChromaDB)  +  (jieba+rank_bm25) (Cross-Encoder)    │
    │                      │                                │
    │              RRF Fusion (k=60)                        │
    └───────────────────────────────────────────────────────┘
             │
             ▼
    ┌───────────────────────────────────────────────────────┐
    │           Pluggable Provider Layer                    │
    │  LLM: OpenAI · Azure · DeepSeek · Ollama             │
    │  Embedding: OpenAI · SiliconFlow · Ollama            │
    │  VectorStore: ChromaDB (Qdrant / Milvus planned)     │
    └───────────────────────────────────────────────────────┘

Key Features

Agentic RAG

ReAct main loop with multi-step reasoning and tool use
5 built-in tools: query_knowledge, search_by_keyword, get_document_list, calculate, get_system_status
SelfChecker: LLM-based hallucination detection and answer validation
ConversationMemory: sliding-window context for multi-turn dialogue

Hybrid Search

Dense retrieval (BGE-m3 via SiliconFlow or any OpenAI-compatible embedding)
Sparse retrieval (BM25 with jieba Chinese tokenization)
RRF (Reciprocal Rank Fusion) score merging — no hyperparameter tuning needed
Optional Cross-Encoder reranker for precision-critical scenarios

MCP Protocol

Full JSON-RPC 2.0 over stdio transport
Plug into Claude Desktop with a one-line config addition
Exposes query_knowledge, ingest_document, list_documents as MCP tools

Full-Stack Observability

TraceContext captures per-stage latency and intermediate results for every query
Streamlit Dashboard: Overview metrics, Agent Chat, Ingestion Manager, Query Traces, Evaluation Panel
Structured logging throughout

Evaluation Pipeline

Ragas integration + custom Hit@K / MRR@K metrics
Golden test set with 21 hand-labeled bilingual QA pairs
Reproducible benchmark scripts; one-click run from Dashboard

Pluggable Architecture

6 swappable layers: LLM · Embedding · VectorStore · Reranker · Splitter · Evaluator
Switch providers by editing config/settings.yaml — zero code changes required
Abstract factory pattern with dependency injection

Tech Stack

Layer	Technology
Agent	Custom ReAct loop, SelfChecker, ConversationMemory
Retrieval	ChromaDB, rank-bm25, jieba, RRF
Reranker	sentence-transformers (Cross-Encoder)
LLM / Embedding	OpenAI / Azure / DeepSeek / Ollama / SiliconFlow
MCP	`mcp` SDK, JSON-RPC 2.0, stdio transport
Dashboard	Streamlit
Evaluation	Ragas, custom metrics
Runtime	Python 3.10+, uv
Testing	pytest (unit · integration · e2e)

Quick Start

# 1. Clone and install
git clone <repo-url>
cd modular-rag-mcp-server
pip install uv && uv sync

# 2. Configure API keys
cp config/settings.yaml  # edit llm.api_key and embedding.api_key

# 3. Ingest documents
python scripts/ingest.py --source path/to/your/docs

# 4. Launch Dashboard
streamlit run src/observability/dashboard/app.py

# 5. Query via CLI
python scripts/query.py "What is the RRF algorithm?"

# 6. Use as MCP Server (add to Claude Desktop config)
# {"mcpServers": {"rag": {"command": "python", "args": ["-m", "main"]}}}
python -m main

Supported LLM providers: openai · azure · deepseek · ollama
Supported Embedding providers: openai · azure · siliconflow · ollama

Project Structure

src/
├── agent/              # ReAct Agent, tool registry, memory, self-checker
│   ├── react_agent.py
│   ├── tool_registry.py
│   ├── tools/          # query, search, list, calculate, status
│   ├── memory/         # ConversationMemory
│   └── reflection/     # SelfChecker (LLM hallucination judge)
├── core/               # Config, settings, DI container
├── ingestion/          # Document parsing (PDF→MD), chunking, embedding pipeline
├── libs/               # Abstract LLM / Embedding / Reranker / Splitter
├── mcp_server/         # MCP server + tool handlers
└── observability/      # Logger, TraceContext, Streamlit Dashboard
scripts/
├── ingest.py           # Ingest documents from CLI
├── query.py            # Single-turn query from CLI
├── agent.py            # Multi-turn agent session from CLI
├── run_benchmark.py    # 4-mode retrieval benchmark
└── evaluate.py         # Ragas evaluation runner
config/
└── settings.yaml       # All configuration in one file
tests/
├── unit/               # Per-module unit tests (no external deps)
├── integration/        # Cross-module integration tests
└── e2e/                # Full pipeline end-to-end tests

Documents

Document	Description
TECHNICAL_DOC.md	Architecture deep-dive, algorithm design, key tradeoffs, interview Q&A
EVALUATION_REPORT.md	Benchmark methodology, results analysis, reproducible scripts

Design Highlights

Why RRF over weighted sum for score fusion?
RRF is rank-based, so it's immune to score distribution differences between Dense and BM25 retrievers — no calibration needed.

Why two-stage retrieval (coarse → fine)?
Dense/BM25 recall cheap candidates at low cost; Cross-Encoder reranker scores the top-K precisely. This keeps latency manageable without sacrificing final precision.

Why ReAct over single-pass RAG?
Multi-step queries (comparison, multi-hop) can't be answered in one retrieval pass. ReAct lets the agent decompose the question, retrieve incrementally, and validate its own answer via SelfChecker.

License

MIT

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured