MCP Servers

docs-search

Enables semantic search over documentation sites by indexing them via sitemap and using OpenAI embeddings, stored locally in ChromaDB.

README

Documentation Search MCP Server

A Model Context Protocol (MCP) server that provides semantic search over documentation sites. Index any documentation by URL, and search it from Claude Code, Cursor, or any MCP-compatible client.

Features

🔍 Semantic Search: OpenAI embeddings for intelligent documentation search
🌐 Auto-Discovery: Automatically finds and parses sitemaps
📦 Local Storage: ChromaDB for persistent, local vector storage
🎨 Simple GUI: Gradio interface for managing indexed sites
🔄 Easy Reindexing: Update documentation with one click
🚀 MCP Compatible: Works with Claude Code, Cursor, and other MCP clients

Installation

Prerequisites

Python 3.10 or higher
OpenAI API key (get one here)

Setup

Clone or navigate to the project directory:

cd docs-mcp-server

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configure OpenAI API key:

Create a .env file in the project root:

cp .env.example .env

Edit .env and add your OpenAI API key:

OPENAI_API_KEY=sk-...

Usage

1. Launch the GUI to Index Documentation

Start the Gradio interface:

python -m src.gui

This will open a web interface at http://127.0.0.1:7860 where you can:

Add documentation sites by URL
View indexed sites and statistics
Reindex existing sites
Delete sites

Example: Indexing LangGraph docs

Go to the "Add Documentation Site" tab
Enter base URL: https://langchain-ai.github.io/langgraph/
Leave sitemap URL empty (auto-discovery)
Click "Index Site"

The indexer will:

Find the sitemap automatically
Crawl all pages
Convert HTML to Markdown
Generate embeddings
Store in local ChromaDB

2. Configure MCP Server

For Claude Code

Add to your Claude Code MCP settings (~/.config/claude-code/mcp.json or via Claude Code settings):

{
  "mcpServers": {
    "docs-search": {
      "command": "python",
      "args": ["-m", "src.server"],
      "cwd": "/absolute/path/to/docs-mcp-server",
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

For Cursor

Add to Cursor MCP settings:

{
  "mcpServers": {
    "docs-search": {
      "command": "python",
      "args": ["-m", "src.server"],
      "cwd": "/absolute/path/to/docs-mcp-server",
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

3. Use the Search Tool

Once configured, you can use the search_docs tool in your MCP client:

Example queries:

"How do I create a state graph in LangGraph?"
"What are the different types of nodes in LangGraph?"
"Show me examples of conditional edges"

The tool will return relevant documentation chunks with:

Source URL
Similarity score
Page content

Project Structure

docs-mcp-server/
├── src/
│   ├── __init__.py        # Package initialization
│   ├── server.py          # MCP server implementation
│   ├── indexer.py         # Documentation crawler and indexer
│   ├── embedder.py        # OpenAI embedding generation
│   ├── db.py              # ChromaDB wrapper
│   └── gui.py             # Gradio management interface
├── data/
│   ├── chroma/            # ChromaDB storage (auto-created)
│   └── config.json        # Indexed sites configuration
├── requirements.txt       # Python dependencies
├── .env.example           # Environment variables template
└── README.md              # This file

How It Works

Indexing Pipeline:
- Discovers sitemap from base URL
- Fetches all pages from sitemap
- Converts HTML to clean Markdown
- Splits content into overlapping chunks
- Generates embeddings using OpenAI
- Stores in ChromaDB with metadata
Search Process:
- User query is embedded using OpenAI
- ChromaDB performs cosine similarity search
- Top results are returned with metadata
- Results include source URL and similarity score

Configuration Options

Indexing Parameters

When adding a site via GUI or code:

base_url: Main documentation URL (required)
sitemap_url: Custom sitemap URL (optional, auto-discovered if not provided)
max_pages: Limit number of pages to index (optional, useful for testing)

Chunking

Default chunk settings in indexer.py:

chunk_size: 1000 characters
overlap: 200 characters

These can be adjusted in the chunk_text() method for your specific needs.

Embedding Model

Default: text-embedding-3-small (OpenAI)

To use a different model, modify embedder.py:

self.model = "text-embedding-3-large"  # More accurate but more expensive

Troubleshooting

"No documentation has been indexed yet"

Run the GUI and add at least one documentation site before using the search tool.

"Could not find sitemap.xml"

Some sites don't have a sitemap. Try providing the sitemap URL manually or ensure the site has a publicly accessible sitemap.

"OpenAI API key not found"

Make sure your .env file exists and contains a valid OPENAI_API_KEY.

ChromaDB errors

Delete the data/chroma/ directory to reset the database:

rm -rf data/chroma/

Then reindex your sites.

Cost Estimation

OpenAI Embedding Costs (text-embedding-3-small):

~$0.02 per 1M tokens
Average documentation site (500 pages): ~$0.10-0.50
Search queries: ~$0.0001 per query

Storage:

ChromaDB is stored locally (no cloud costs)
Average site: 50-200 MB

Advanced Usage

Programmatic Indexing

You can index sites programmatically:

from src.embedder import Embedder
from src.db import DocsDatabase
from src.indexer import DocumentIndexer

embedder = Embedder(api_key="sk-...")
database = DocsDatabase()
indexer = DocumentIndexer(embedder, database)

result = indexer.index_site(
    base_url="https://docs.example.com",
    max_pages=100  # Optional limit
)

print(f"Indexed {result['pages_indexed']} pages")

Custom Search

from src.embedder import Embedder
from src.db import DocsDatabase

embedder = Embedder()
database = DocsDatabase()

# Search
query_embedding = embedder.embed_text("your query")
results = database.search_all_collections(query_embedding, n_results=10)

for result in results:
    print(f"URL: {result['metadata']['url']}")
    print(f"Content: {result['document'][:200]}...")

Roadmap

[ ] Support for custom embedding models (local transformers)
[ ] Incremental updates (detect changed pages)
[ ] Better HTML parsing for specific doc frameworks
[ ] Export/import indexed data
[ ] REST API for search
[ ] Support for PDF documentation

Contributing

Contributions welcome! Some ideas:

Add support for more documentation formats
Improve HTML to Markdown conversion
Add more embedding providers
Enhance the GUI

License

MIT License - feel free to use and modify!

Credits

Built with:

MCP - Model Context Protocol
ChromaDB - Vector database
OpenAI - Embeddings
Gradio - GUI framework
BeautifulSoup - HTML parsing

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured