docrag
Provides RAG (Retrieval Augmented Generation) access to technical documentation through MCP, enabling LLMs to search and retrieve relevant documentation on-demand.
README
DocRAG - AI Documentation RAG System
A lightweight, installable Python package that provides RAG (Retrieval Augmented Generation) access to technical documentation through an MCP (Model Context Protocol) server. This enables LLMs to search and retrieve relevant documentation on-demand.
Features
- š Single pip-installable package with CLI and MCP server
- š Project-based documentation collections (BrightSign, Venafi, Qumu, web frameworks)
- š Local vector database with efficient embedding using LanceDB
- š„ Easy documentation ingestion from local files or scraped sources
- š¤ Designed for use with Claude Code via MCP
Installation
Prerequisites
- Python 3.10+
- pipx (recommended) or pip
- git (for updates)
Recommended: Install globally with pipx
# Install globally with pipx in editable mode (keeps dependencies isolated)
pipx install -e /opt/claude-ops/doc-rag
# Verify installation
docrag --help
# Optional: Install Playwright browsers (for scraping)
pipx runpip docrag install playwright
pipx run --spec docrag playwright install chromium
Note: The -e flag installs in "editable" mode, which means changes to the source code are immediately reflected without reinstalling.
Alternative: Install from source (development)
# Clone or navigate to the project directory
cd /opt/claude-ops/doc-rag
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install in development mode
pip install -e ".[dev]"
# Install Playwright browsers (for scraping)
playwright install chromium
Updating DocRAG
Option 1: Using the Update Script (Recommended)
cd /opt/claude-ops/doc-rag
./update.sh
This script will:
- Pull latest changes from git
- Detect your installation method (pipx or pip)
- Reinstall only if necessary (non-editable installs)
- Handle editable installs automatically
Option 2: Using Make
cd /opt/claude-ops/doc-rag
make update
Option 3: Manual Update
For editable installs (installed with -e):
cd /opt/claude-ops/doc-rag
git pull origin main
# No reinstall needed - changes are already active!
For regular installs (installed without -e):
cd /opt/claude-ops/doc-rag
git pull origin main
pipx uninstall docrag && pipx install -e .
# or for pip: pip install -e . --force-reinstall
Verifying Updates
# Check git status
cd /opt/claude-ops/doc-rag
git log -1 --oneline
# Test the installation
docrag --version
docrag --help
Quick Start
1. Initialize DocRAG
docrag init
This creates the configuration directory at ~/.docrag/ with the following structure:
~/.docrag/
āāā config.json # Global configuration
āāā collections/ # Documentation collections
āāā vectordb/ # LanceDB storage
2. Add a Documentation Collection
# Add documentation from a local directory
docrag add brightsign --source /path/to/brightsign/docs --description "BrightSign player documentation"
# Or add without source initially
docrag add venafi --description "Venafi TPP API documentation"
3. List Collections
docrag list
4. Search Documentation (CLI Testing)
# Search across all active collections
docrag search "how to initialize the player"
# Search a specific collection
docrag search "authentication methods" --collection venafi --limit 10
5. Start the MCP Server
docrag serve
The server will listen on stdio for connections from Claude Code.
CLI Commands
docrag init
Initialize DocRAG configuration directory.
docrag add <name>
Add a new documentation collection.
Options:
-s, --source PATH- Source directory containing documentation-d, --description TEXT- Description of the collection
Example:
docrag add qumu --source ~/docs/qumu --description "Qumu video platform docs"
docrag list
List all documentation collections with their status.
docrag update <name> <source>
Update an existing collection with new documents.
Example:
docrag update brightsign ~/docs/brightsign/updated
docrag remove <name>
Remove a documentation collection (with confirmation).
docrag search <query>
Search documentation from the CLI for testing.
Options:
-c, --collection TEXT- Specific collection to search-l, --limit INTEGER- Number of results (default: 5)
Example:
docrag search "websocket connection" --collection brightsign
docrag serve
Start the MCP server for Claude Code integration.
docrag scrape <url>
Scrape documentation from websites.
Options:
-o, --output PATH- Output directory (required)--smart, --use-crawl4ai- Use AI-powered Crawl4AI scraper (recommended)--no-llm- Disable LLM extraction (faster, still better than basic)--llm-provider TEXT- LLM provider (default: openai/gpt-4o-mini)--playwright- Use Playwright for dynamic content (basic scraper)--max-pages INTEGER- Maximum pages to scrape (default: 1000)
Examples:
# Basic scraping
docrag scrape https://docs.example.com --output ./docs
# Smart scraping with AI (recommended)
docrag scrape https://docs.example.com --output ./docs --smart
# Smart scraping without LLM (faster, no API key needed)
docrag scrape https://docs.example.com --output ./docs --smart --no-llm
# Limit pages
docrag scrape https://docs.example.com --output ./docs --max-pages 100
Smart Scraping Features:
- ⨠AI-powered content extraction
- šÆ Automatically removes navigation and boilerplate
- š Better handling of complex layouts
- š§ Semantic understanding of documentation structure
- ā” Faster and more accurate than basic scraping
To enable smart scraping:
# Install Crawl4AI
pipx inject docrag crawl4ai
# Optional: Set OpenAI API key for LLM-powered extraction
export OPENAI_API_KEY='your-key-here'
Using with Claude Code
1. Configure Claude Code MCP Settings
Add DocRAG to your Claude Code MCP configuration (~/.config/claude-code/mcp_settings.json or similar):
{
"mcpServers": {
"docrag": {
"command": "docrag",
"args": ["serve"],
"env": {}
}
}
}
If using the full path:
{
"mcpServers": {
"docrag": {
"command": "/home/claude-admin/.local/bin/docrag",
"args": ["serve"],
"env": {}
}
}
}
2. Restart Claude Code
After adding the configuration, restart Claude Code to load the MCP server.
3. Use in Claude Code
Once connected, Claude Code can use two tools:
search_docs: Search through indexed documentation collections
Query: "how to handle authentication in BrightSign"
Collection: (optional) "brightsign"
Limit: (optional) 5
list_collections: List all available documentation collections
Claude will automatically use these tools when working on projects that need documentation access.
Architecture
Core Components
- ConfigManager (
config.py) - Manages configuration and collection metadata - EmbeddingGenerator (
embeddings.py) - Generates embeddings using sentence-transformers - VectorDB (
vectordb.py) - LanceDB wrapper for vector storage and search - DocumentIndexer (
indexer.py) - Intelligent document chunking and indexing - DocRAGServer (
server.py) - MCP server implementation - CLI (
cli.py) - Command-line interface
Technical Stack
- MCP Framework: Official Anthropic MCP package
- Vector Database: LanceDB (lightweight, file-based, performant)
- Embeddings: sentence-transformers with all-MiniLM-L6-v2 model (384 dims, fast, local)
- Text Processing: langchain-text-splitters for intelligent chunking
- CLI: Click for user-friendly commands
- Web Scraping: Playwright + BeautifulSoup4 for scraping
Data Structure
~/.docrag/
āāā config.json # Global configuration
ā āāā {
ā "active_collections": ["brightsign", "venafi"],
ā "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
ā "chunk_size": 512,
ā "chunk_overlap": 50
ā }
āāā collections/
ā āāā brightsign/
ā ā āāā metadata.json # Collection metadata
ā ā āāā source_docs/ # Original documents
ā āāā venafi/
ā āāā qumu/
āāā vectordb/
āāā lancedb/ # Vector storage (one table per collection)
Configuration
Global configuration is stored in ~/.docrag/config.json:
{
"active_collections": ["brightsign", "venafi"],
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"chunk_size": 512,
"chunk_overlap": 50
}
Collection metadata is stored in ~/.docrag/collections/<name>/metadata.json:
{
"name": "brightsign",
"source_type": "local",
"source_path": "/path/to/docs",
"created_at": "2025-10-28T10:00:00",
"updated_at": "2025-10-28T10:00:00",
"doc_count": 150,
"description": "BrightSign player documentation"
}
Development
Project Structure
docrag/
āāā docrag/
ā āāā __init__.py
ā āāā cli.py # CLI commands
ā āāā server.py # MCP server
ā āāā indexer.py # Document indexing
ā āāā vectordb.py # Vector database
ā āāā embeddings.py # Embeddings
ā āāā config.py # Configuration
ā āāā scrapers/ # Web scrapers
ā āāā __init__.py
ā āāā base.py
ā āāā generic.py
āāā tests/
āāā pyproject.toml
āāā README.md
āāā DOCRAG_MVP_BUILD_GUIDE.md
Running Tests
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
Code Formatting
# Format with black
black docrag/
# Lint with ruff
ruff check docrag/
Troubleshooting
"DocRAG not initialized"
Run docrag init first to create the configuration directory.
"No collections found"
Add a collection with docrag add <name> --source <path>.
"Model download fails"
The first time you run DocRAG, it will download the sentence-transformers model (~100MB). Ensure you have internet connectivity.
"Playwright not installed"
If using scrapers, run playwright install chromium.
Future Enhancements
- [ ] Web scraper CLI commands
- [ ] Support for more file types (PDF, HTML, RST)
- [ ] Incremental indexing (only index changed files)
- [ ] Collection activation/deactivation
- [ ] Collection statistics and health checks
- [ ] Export/import collections
- [ ] Cloud sync for collections
- [ ] Advanced search filters
License
MIT
Author
Ryan - Built for homelab and Claude Code integration
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.