mcp-retrieve
An MCP server for indexing and searching local text files using late-interaction retrieval (ColBERT-style MaxSim), enabling token-level relevance matching.
README
mcp-retrieve
An MCP server that exposes late-interaction
document retrieval (ColBERT-style MaxSim) over a local folder. Point it at a
directory of text/markdown/code, and any MCP client — Claude Desktop, an IDE
agent, your own host — can index it and search it with token-level
relevance.
It ships with a deterministic, model-free default embedder, so the server and its full test suite run offline with no model weights, no API key, no network. When you want production-grade semantics, drop in a real ColBERT / ColQwen encoder behind a small protocol — ranking code does not change.
What is MCP?
The Model Context Protocol is an open standard that lets LLM applications
connect to external tools and data through a uniform server interface. A host
(e.g. Claude Desktop) launches MCP servers and calls the tools they
advertise. mcp-retrieve is such a server; it advertises two tools:
| Tool | Purpose |
|---|---|
index_folder(folder) |
Read text files under folder, chunk them, embed each chunk into a multi-vector representation, and build an in-memory index. |
search(query, k=5) |
Rank indexed chunks by MaxSim late interaction and return the top k with source path, score, and snippet. |
The retrieval approach: late interaction (ColBERT)
Most dense retrievers compress a passage into one vector and compare it to one query vector — cheap, but lossy. Late interaction, introduced by ColBERT (Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020, arXiv:2004.12832), keeps one vector per token for both the query and the document and defers their interaction to scoring time via the MaxSim operator:
score(q, d) = Σ_i max_j sim(q_i, d_j)
Each query token q_i is matched to its single most similar document token
d_j, and those per-token maxima are summed. This preserves fine-grained term
matching (a query term can find its evidence anywhere in the passage) while
staying efficient. With L2-normalised vectors, sim is cosine similarity, so
MaxSim reduces to a dot product followed by a row-wise max and a sum — which is
exactly what mcp_retrieve.retrieval.maxsim computes.
Install
pip install -e . # core: mcp + numpy
pip install -e ".[dev]" # plus pytest for the test suite
Register with an MCP client
For Claude Desktop, add the server to its mcpServers config
(claude_desktop_config.json):
{
"mcpServers": {
"mcp-retrieve": {
"command": "mcp-retrieve"
}
}
}
If you installed into a virtual environment, use the absolute path to the
mcp-retrieve console script (or "command": "python", "args": ["-m", "mcp_retrieve"]). Restart the client, then ask it to index a folder and search
it — the model will call the index_folder and search tools.
Usage from Python
from mcp_retrieve import RetrievalIndex
index = RetrievalIndex() # deterministic default embedder
index.index_folder("./docs")
for hit in index.search("late interaction maxsim", k=5):
print(f"{hit.score:.3f} {hit.chunk.source} {hit.snippet}")
Plugging in a real late-interaction model
The default HashingEmbedder makes the project run anywhere, but it matches on
character n-grams, not meaning. For real semantics, implement the Embedder
protocol around a trained encoder and pass it in:
import numpy as np
from mcp_retrieve import RetrievalIndex
from mcp_retrieve.server import create_server
class ColbertEmbedder:
"""Wrap a ColBERT checkpoint as a multi-vector Embedder."""
def __init__(self, checkpoint: str) -> None:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
self._ckpt = Checkpoint(checkpoint, ColBERTConfig())
@property
def dim(self) -> int:
return 128
def embed(self, text: str) -> "np.ndarray":
vecs = self._ckpt.docFromText([text])[0] # (num_tokens, 128)
return np.asarray(vecs, dtype=np.float32)
# Use it from Python …
index = RetrievalIndex(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))
# … or run the MCP server with it.
server = create_server(embedder=ColbertEmbedder("colbert-ir/colbertv2.0"))
server.run()
Any object exposing dim: int and embed(text) -> ndarray[num_tokens, dim]
with L2-normalised rows satisfies the protocol — ColBERT, ColQwen, ColPali, or
your own. The ranking and chunking code is encoder-agnostic.
Architecture
src/mcp_retrieve/
embedder.py # Embedder protocol + deterministic HashingEmbedder default
retrieval.py # chunking, MaxSim, RetrievalIndex (pure — no MCP dependency)
server.py # FastMCP server exposing index_folder + search (only MCP import)
The retrieval and embedding cores import nothing MCP-related, so they are
testable and reusable on their own; the SDK is isolated to server.py and
imported lazily.
Testing
python -m pytest
All retrieval and embedder tests run offline with the default embedder. The
end-to-end FastMCP tool test is skipped automatically when the mcp package is
not installed.
License
MIT © 2026 Max Baluev
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.