linked-docs
Enables AI assistants to intelligently search and reference documentation using hybrid semantic + keyword search via MCP protocol.
README
Linked Documentation System
A production-ready RAG system with dual interfaces: REST API + MCP Protocol
Enables web applications and AI assistants to intelligently search and reference documentation using hybrid semantic + keyword search.
šÆ What Is This?
This is a complete RAG (Retrieval Augmented Generation) system that provides two ways to access powerful documentation search:
- REST API Server (
main.py) - FastAPI-based HTTP server for web applications and integrations - MCP Server (
mcp_server.py) - Model Context Protocol server for AI assistants (Cursor, Claude Desktop)
Both servers share the same hybrid search engine, enabling accurate documentation retrieval whether you're building a web app or empowering an AI assistant.
Key Features
- Intelligent Hybrid Search: Combines semantic understanding (FAISS embeddings) with keyword matching (BM25)
- Smart Ranking: Title/metadata boosting, multi-chunk document expansion, relevance scoring
- Multi-Format Support: PDF, Markdown, and web documentation (via built-in scraper)
- MCP Native: Works seamlessly in Cursor, Claude Desktop, and other MCP-compatible tools
- Enterprise Ready: Access control, audit logging, local-first architecture
- Fast: <200ms search latency, optimized chunking and indexing
- Zero Cost: Runs 100% locally, no API keys or cloud dependencies
šļø How It Works
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā AI Assistant (Cursor/Claude) ā
āāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā MCP Protocol (JSON-RPC over stdio)
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā mcp_server.py ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā Tools: search_documentation(), list_sources() ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
āāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāā
ā¼ ā¼ ā¼
āāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāāāāāā
ā Hybrid ā ā Access ā ā Audit Logger ā
ā Search ā ā Control ā ā ā
āāāāāā¬āāāāā āāāāāāāāāāāā āāāāāāāāāāāāāāāā
ā
āā Semantic Search (FAISS + embeddings)
ā ⢠Title/metadata boosting
ā ⢠Multi-chunk document expansion
ā
āā Keyword Search (BM25)
⢠Exact term matching
⢠Traditional ranking
Project Structure
LinkedDocsMCP/
āāā mcp_server.py # Main MCP server (stdio interface)
āāā main.py # FastAPI server (for testing/debugging)
āāā download_docs.py # Web documentation scraper CLI
āāā connectors/ # Document format handlers
ā āāā pdf.py # PDF extraction
ā āāā markdown.py # Markdown parsing
āāā indexing/ # Search engine core
ā āāā chunker.py # Semantic text chunking
ā āāā embedder.py # Sentence transformers
ā āāā vector_store.py # FAISS vector database
ā āāā keyword_search.py # BM25 implementation
ā āāā hybrid_search.py # Combined search with boosting
āāā schemas/ # Data models
ā āāā config.py # Settings & configuration
ā āāā document.py # Document schemas
āāā core/ # Cross-cutting concerns
ā āāā access_control.py # Permission system
ā āāā audit.py # Query logging
āāā data/ # Local storage
āāā sources/ # Your documents (PDF, MD)
āāā vector_store/ # Indexed vectors & metadata
š Quick Start (5 Minutes)
Prerequisites
- Python 3.10+
- Cursor or Claude Desktop (for MCP integration)
1. Install Dependencies
pip install -r requirements.txt
First run: Downloads ~80MB embedding model (one-time)
2. Add Documentation
Option A: Download from web
# Download Factorio wiki (example)
python download_docs.py https://wiki.factorio.com/Tutorials --crawl --max 20
Option B: Add your own files
# Copy PDFs or Markdown files
copy your-docs.pdf data/sources/
copy your-guide.md data/sources/
3. Set Up MCP in Cursor (or other LLM service)
Add to your Cursor MCP config (~/.cursor/mcp.json or C:\Users\<USER>\.cursor\mcp.json):
{
"mcpServers": {
"linked-docs": {
"command": "python",
"args": ["C:/full/path/to/LinkedDocsMCP/mcp_server.py"]
}
}
}
4. Restart Cursor & Use!
In Cursor's chat:
What are the different enemy types in Factorio?
The AI will automatically search your documentation and provide accurate, cited answers! āØ
Key Features Explained
Hybrid Search
Combines two complementary approaches:
Semantic Search (70% weight)
- Uses
sentence-transformers(all-MiniLM-L6-v2 model) - Understands meaning: "authentication setup" matches "configuring auth"
- Converts text to 384-dimensional vectors
- Fast similarity search with FAISS
Keyword Search (30% weight)
- Uses BM25 algorithm (same as Elasticsearch)
- Exact term matching: great for technical terms, code, etc.
- Traditional ranking with document length normalization
Smart Ranking Enhancements:
- Title Boosting: Documents whose titles match the query get 3x boost
- Multi-Chunk Expansion: Returns up to 3 sequential chunks from highly relevant documents
- Document Grouping: Results grouped by source document for better context
Semantic Chunking
Unlike naive character-splitting, this uses smart boundaries:
- Markdown headers (
##,###) - keeps sections together - Paragraph breaks (
\n\n) - maintains topical coherence - Sentences - fallback for unstructured text
Settings:
- Chunk size: 2048 characters (whole sections, not fragments)
- Overlap: 200 characters (prevents context loss at boundaries)
Web Documentation Scraper
Built-in tool to download and convert web docs:
# Download single page
python download_docs.py https://wiki.example.com/Guide
# Crawl multiple pages (with smart duplicate detection)
python download_docs.py https://wiki.example.com/Main --crawl --max 50
# Force re-download (skip existing detection)
python download_docs.py https://wiki.example.com/Main --crawl --force
# Filter by language
python download_docs.py https://wiki.example.com/Main --crawl --languages en,de
Features:
- Auto-detects and skips existing pages (saves time & bandwidth)
- Respects same-domain and link patterns
- Polite crawling with configurable delays
- Converts HTML to clean Markdown with metadata
- Preserves document structure (headers, lists, tables)
š Security & Access Control
Built-in features:
- 4-tier access hierarchy: PUBLIC ā INTERNAL ā RESTRICTED ā CONFIDENTIAL
- Query-time filtering based on user permissions
- Full audit logging (every query tracked)
- Local-only processing (no data leaves your machine)
Audit logs (data/audit.log):
{
"timestamp": "2025-10-21T14:30:00Z",
"event_type": "search",
"user_id": "mcp_client",
"query": "enemy types",
"results_count": 5,
"search_time_ms": 143
}
āļø Configuration
Edit schemas/config.py or set environment variables:
# Search weights
SEMANTIC_WEIGHT = 0.7 # Meaning-based search
KEYWORD_WEIGHT = 0.3 # Exact term matching
# Chunking
CHUNK_SIZE = 1280 # Larger chunks for complete sections
CHUNK_OVERLAP = 128 # Overlap for context continuity
# Model
EMBEDDING_MODEL = "all-MiniLM-L6-v2" # Fast, accurate, small
š§ Advanced Usage
REST API (for testing/debugging)
# Start FastAPI server
python main.py
# Search via REST
curl -X POST http://localhost:8000/api/v1/search_docs \
-H "Content-Type: application/json" \
-d '{"query": "getting started", "top_k": 5}'
# Interactive API docs
open http://localhost:8000/docs
Programmatic Usage
from indexing.embedder import Embedder
from indexing.vector_store import VectorStore
from indexing.hybrid_search import HybridSearchEngine
# Initialize
embedder = Embedder(model_name="all-MiniLM-L6-v2")
vector_store = VectorStore(embedding_dim=384)
search_engine = HybridSearchEngine(vector_store, keyword_searcher, embedder)
# Search
results = search_engine.search("how to configure auth", top_k=5)
for chunk, score in results:
print(f"{score:.3f}: {chunk.text[:100]}...")
Technical Highlights
- Hybrid search outperforms pure semantic or keyword alone
- Smart chunking preserves document structure
- Title boosting dramatically improves ranking quality
- Multi-chunk expansion provides complete context
- Zero cloud dependencies - privacy-first architecture
- MCP native - works with any compatible AI assistant
š Acknowledgments
Built with:
- FastAPI - Modern Python web framework
- Sentence Transformers - Semantic embeddings
- FAISS - Vector similarity search
- BM25 (rank-bm25) - Keyword ranking
- MCP - Model Context Protocol by Anthropic
Status: ā Demo Ready | Version: 1.0.0 | Updated: October 2025
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.