MCP Servers

linked-docs

Enables AI assistants to intelligently search and reference documentation using hybrid semantic + keyword search via MCP protocol.

README

Linked Documentation System

A production-ready RAG system with dual interfaces: REST API + MCP Protocol
Enables web applications and AI assistants to intelligently search and reference documentation using hybrid semantic + keyword search.

🎯 What Is This?

This is a complete RAG (Retrieval Augmented Generation) system that provides two ways to access powerful documentation search:

REST API Server (main.py) - FastAPI-based HTTP server for web applications and integrations
MCP Server (mcp_server.py) - Model Context Protocol server for AI assistants (Cursor, Claude Desktop)

Both servers share the same hybrid search engine, enabling accurate documentation retrieval whether you're building a web app or empowering an AI assistant.

Key Features

Intelligent Hybrid Search: Combines semantic understanding (FAISS embeddings) with keyword matching (BM25)
Smart Ranking: Title/metadata boosting, multi-chunk document expansion, relevance scoring
Multi-Format Support: PDF, Markdown, and web documentation (via built-in scraper)
MCP Native: Works seamlessly in Cursor, Claude Desktop, and other MCP-compatible tools
Enterprise Ready: Access control, audit logging, local-first architecture
Fast: <200ms search latency, optimized chunking and indexing
Zero Cost: Runs 100% locally, no API keys or cloud dependencies

🏗️ How It Works

┌─────────────────────────────────────────────────────────┐
│              AI Assistant (Cursor/Claude)               │
└────────────────────┬────────────────────────────────────┘
                     │ MCP Protocol (JSON-RPC over stdio)
                     ▼
┌─────────────────────────────────────────────────────────┐
│                   mcp_server.py                         │
│  ┌─────────────────────────────────────────────────┐   │
│  │  Tools: search_documentation(), list_sources()  │   │
│  └─────────────────────────────────────────────────┘   │
└────────────────────┬────────────────────────────────────┘
                     │
    ┌────────────────┼────────────────┐
    ▼                ▼                ▼
┌─────────┐   ┌──────────┐   ┌──────────────┐
│ Hybrid  │   │ Access   │   │ Audit Logger │
│ Search  │   │ Control  │   │              │
└────┬────┘   └──────────┘   └──────────────┘
     │
     ├─ Semantic Search (FAISS + embeddings)
     │   • Title/metadata boosting
     │   • Multi-chunk document expansion
     │
     └─ Keyword Search (BM25)
         • Exact term matching
         • Traditional ranking

Project Structure

LinkedDocsMCP/
├── mcp_server.py          # Main MCP server (stdio interface)
├── main.py                # FastAPI server (for testing/debugging)
├── download_docs.py       # Web documentation scraper CLI
├── connectors/            # Document format handlers
│   ├── pdf.py            # PDF extraction
│   └── markdown.py       # Markdown parsing
├── indexing/             # Search engine core
│   ├── chunker.py        # Semantic text chunking
│   ├── embedder.py       # Sentence transformers
│   ├── vector_store.py   # FAISS vector database
│   ├── keyword_search.py # BM25 implementation
│   └── hybrid_search.py  # Combined search with boosting
├── schemas/              # Data models
│   ├── config.py         # Settings & configuration
│   └── document.py       # Document schemas
├── core/                 # Cross-cutting concerns
│   ├── access_control.py # Permission system
│   └── audit.py          # Query logging
└── data/                 # Local storage
    ├── sources/          # Your documents (PDF, MD)
    └── vector_store/     # Indexed vectors & metadata

🚀 Quick Start (5 Minutes)

Prerequisites

Python 3.10+
Cursor or Claude Desktop (for MCP integration)

1. Install Dependencies

pip install -r requirements.txt

First run: Downloads ~80MB embedding model (one-time)

2. Add Documentation

Option A: Download from web

# Download Factorio wiki (example)
python download_docs.py https://wiki.factorio.com/Tutorials --crawl --max 20

Option B: Add your own files

# Copy PDFs or Markdown files
copy your-docs.pdf data/sources/
copy your-guide.md data/sources/

3. Set Up MCP in Cursor (or other LLM service)

Add to your Cursor MCP config (~/.cursor/mcp.json or C:\Users\<USER>\.cursor\mcp.json):

{
  "mcpServers": {
    "linked-docs": {
      "command": "python",
      "args": ["C:/full/path/to/LinkedDocsMCP/mcp_server.py"]
    }
  }
}

4. Restart Cursor & Use!

In Cursor's chat:

What are the different enemy types in Factorio?

The AI will automatically search your documentation and provide accurate, cited answers! ✨

Key Features Explained

Hybrid Search

Combines two complementary approaches:

Semantic Search (70% weight)

Uses sentence-transformers (all-MiniLM-L6-v2 model)
Understands meaning: "authentication setup" matches "configuring auth"
Converts text to 384-dimensional vectors
Fast similarity search with FAISS

Keyword Search (30% weight)

Uses BM25 algorithm (same as Elasticsearch)
Exact term matching: great for technical terms, code, etc.
Traditional ranking with document length normalization

Smart Ranking Enhancements:

Title Boosting: Documents whose titles match the query get 3x boost
Multi-Chunk Expansion: Returns up to 3 sequential chunks from highly relevant documents
Document Grouping: Results grouped by source document for better context

Semantic Chunking

Unlike naive character-splitting, this uses smart boundaries:

Markdown headers (##, ###) - keeps sections together
Paragraph breaks (\n\n) - maintains topical coherence
Sentences - fallback for unstructured text

Settings:

Chunk size: 2048 characters (whole sections, not fragments)
Overlap: 200 characters (prevents context loss at boundaries)

Web Documentation Scraper

Built-in tool to download and convert web docs:

# Download single page
python download_docs.py https://wiki.example.com/Guide

# Crawl multiple pages (with smart duplicate detection)
python download_docs.py https://wiki.example.com/Main --crawl --max 50

# Force re-download (skip existing detection)
python download_docs.py https://wiki.example.com/Main --crawl --force

# Filter by language
python download_docs.py https://wiki.example.com/Main --crawl --languages en,de

Features:

Auto-detects and skips existing pages (saves time & bandwidth)
Respects same-domain and link patterns
Polite crawling with configurable delays
Converts HTML to clean Markdown with metadata
Preserves document structure (headers, lists, tables)

🔒 Security & Access Control

Built-in features:

4-tier access hierarchy: PUBLIC → INTERNAL → RESTRICTED → CONFIDENTIAL
Query-time filtering based on user permissions
Full audit logging (every query tracked)
Local-only processing (no data leaves your machine)

Audit logs (data/audit.log):

{
  "timestamp": "2025-10-21T14:30:00Z",
  "event_type": "search",
  "user_id": "mcp_client",
  "query": "enemy types",
  "results_count": 5,
  "search_time_ms": 143
}

⚙️ Configuration

Edit schemas/config.py or set environment variables:

# Search weights
SEMANTIC_WEIGHT = 0.7  # Meaning-based search
KEYWORD_WEIGHT = 0.3   # Exact term matching

# Chunking
CHUNK_SIZE = 1280      # Larger chunks for complete sections
CHUNK_OVERLAP = 128    # Overlap for context continuity

# Model
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Fast, accurate, small

🔧 Advanced Usage

REST API (for testing/debugging)

# Start FastAPI server
python main.py

# Search via REST
curl -X POST http://localhost:8000/api/v1/search_docs \
  -H "Content-Type: application/json" \
  -d '{"query": "getting started", "top_k": 5}'

# Interactive API docs
open http://localhost:8000/docs

Programmatic Usage

from indexing.embedder import Embedder
from indexing.vector_store import VectorStore
from indexing.hybrid_search import HybridSearchEngine

# Initialize
embedder = Embedder(model_name="all-MiniLM-L6-v2")
vector_store = VectorStore(embedding_dim=384)
search_engine = HybridSearchEngine(vector_store, keyword_searcher, embedder)

# Search
results = search_engine.search("how to configure auth", top_k=5)
for chunk, score in results:
    print(f"{score:.3f}: {chunk.text[:100]}...")

Technical Highlights

Hybrid search outperforms pure semantic or keyword alone
Smart chunking preserves document structure
Title boosting dramatically improves ranking quality
Multi-chunk expansion provides complete context
Zero cloud dependencies - privacy-first architecture
MCP native - works with any compatible AI assistant

🙏 Acknowledgments

Built with:

FastAPI - Modern Python web framework
Sentence Transformers - Semantic embeddings
FAISS - Vector similarity search
BM25 (rank-bm25) - Keyword ranking
MCP - Model Context Protocol by Anthropic

Status: ✅ Demo Ready | Version: 1.0.0 | Updated: October 2025

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured