vectorise-mcp

vectorise-mcp

Local MCP server that indexes folders of documents into a hybrid vector + keyword search index for Claude Desktop, with support for PDFs, Office files, and images via OCR.

Category
Visit Server

README

vectorise-mcp

Local MCP server that turns folders of documents into a hybrid vector + keyword index that Claude Desktop can search. Stays offline after first model download.

PyPI

Stack

Install

pip install vectorise-mcp                 # core
pip install "vectorise-mcp[ocr]"          # + OCR for scanned PDFs / images
pip install "vectorise-mcp[notify]"       # + desktop toast on job completion
pip install "vectorise-mcp[ocr,notify]"   # everything

vectorise-mcp setup                       # pre-download models (~250MB)

Python ≥ 3.10.

Wire into Claude Desktop

claude_desktop_config.json:

{
  "mcpServers": {
    "vectorise": {
      "command": "vectorise-mcp",
      "args": ["serve"]
    }
  }
}

Config file location:

  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

Restart Claude Desktop.

File support

Format Notes
.pdf text + OCR fallback for scanned pages
.docx, .pptx, .xlsx, .xlsm, .xls full content + tables
.txt, .md, .markdown UTF-8
.png, .jpg, .jpeg, .tiff, .bmp, .webp OCR (requires [ocr])
.doc, .ppt detected, skipped, reported

Tools exposed to Claude

Tool What it does
vectorise_list_projects list all indexed projects
vectorise_index_project(folder, project, mode) start indexing job, returns job_id instantly
vectorise_reindex_project(project) SHA1-incremental rescan of all sources
vectorise_index_status(job_id) instant job snapshot incl. progress + ETA
vectorise_await_index(job_id, timeout_sec) optional blocking wait
vectorise_list_jobs(active_only) jobs from current server session
vectorise_search(project, query, k, candidate_pool, file_glob, subdirectory, page_min, page_max, min_similarity) hybrid + reranked search
vectorise_delete_project(project) delete project's .db

mode for vectorise_index_project: auto (default — incremental if path already indexed, error on conflict) / replace / append / fail.

Architecture

Indexing job runs in a daemon thread with its own asyncio loop. The MCP server's main loop stays free to serve index_status / search calls regardless of how heavy the embedding/OCR work is. Status calls are instant; search works on the partial index while a job is running.

folder
  ↓  parsers.parse                        (.pdf .docx .pptx .xlsx ...)
chunks (sentence-aware, 384 tok / 96 overlap, single-sentence hard-split)
  ↓  embedder.embed_passages              (BGE-small)
sqlite-vec   +   FTS5 (BM25)              ← per-file SHA1 dedup, basename collision auto-rename
  ↓  search                               (vector top-N + BM25 top-N)
RRF fusion → cross-encoder rerank → top-K

Project DBs live in ~/.vectorise-mcp/<name>.db. Self-contained — source folder can be deleted after indexing.

Config (env vars)

Var Default Purpose
VECTORISE_MCP_EMBED_MODEL BAAI/bge-small-en-v1.5 must be 384-dim
VECTORISE_MCP_RERANKER_MODEL BAAI/bge-reranker-base
VECTORISE_MCP_EMBED_BATCH 32
VECTORISE_MCP_RERANKER_BATCH 16
VECTORISE_MCP_OCR_MIN_CONFIDENCE 0.5 drop OCR lines below
VECTORISE_MCP_OCR_WORKERS 4 parallel page OCR threads
VECTORISE_MCP_OCR_DPI 200 PDF rasterisation DPI
VECTORISE_MCP_OCR_MAX_DIM 4000 downscale huge images before OCR
VECTORISE_MCP_NOTIFY 1 desktop toast on/off

Performance

CPU GPU
Indexing throughput ~80 chunks/sec 5–10× faster
Search latency (k=5, ≤500K chunks) ~150ms similar
Disk per chunk ~2 KB
Cold start ~5s (lazy model load)

Local dev

git clone https://github.com/jameslovespancakes/Vectorised-Embedding-MCP
cd Vectorised-Embedding-MCP
pip install -e ".[ocr,notify]"

# tests bypass MCP transport, drive indexer + tools directly
python tests/smoke_test.py
python tests/smoke_test_projects.py
python tests/smoke_test_jobs.py
python tests/smoke_test_filters.py
python tests/smoke_test_office.py
python tests/smoke_test_chunking.py
python tests/smoke_test_legacy_skip.py

License

MIT.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured