MCP Servers

rag-paper

A local-first paper RAG server that enables searching and managing academic PDFs via MCP tools, supporting metadata enrichment and citation graphs.

README

rag-paper

Local-first paper RAG for PDF research workflows.

rag-paper helps you build a private, local, searchable paper library from PDF files. It parses PDFs, chunks paper text, creates embeddings, stores vectors in local Chroma, and exposes retrieval through a CLI and MCP server for tools such as Codex CLI and Claude Code.

It is designed for researchers, students, engineers, and LLM power users who want to read papers efficiently while saving tokens and using affordable models such as DeepSeek, Qwen, local Ollama embeddings, or OpenAI-compatible embedding services.

中文 README

Keywords

paper RAG, local RAG, PDF RAG, academic search, Chroma, MCP server, Codex CLI, Claude Code, Zotero, Obsidian, citation graph, Mermaid, DOI enrichment, CrossRef, OpenAlex, semantic deduplication, local vector database, research assistant, low-cost LLM workflow

Why rag-paper

Token-efficient paper reading: retrieve only relevant chunks instead of sending whole PDFs to an LLM.
Local-first storage: vectors and metadata are stored locally in Chroma.
MCP ready: expose paper search tools to Codex CLI, Claude Code, and other MCP clients.
Low-cost model friendly: use local Ollama embeddings by default, or an OpenAI-compatible embedding endpoint.
Zotero friendly: point root_path to one or more Zotero storage/export directories.
Obsidian friendly: export citation graphs as Mermaid Markdown.
Privacy controls: use skip_marker_file to prevent sensitive folders from being indexed.

Requirements

Python 3.10+
A local or remote embedding provider
Default embedding setup: Ollama with qwen3-embedding:4b

ollama pull qwen3-embedding:4b

Installation

git clone https://github.com/your-name/rag-paper.git
cd rag-paper
python -m venv .venv
source .venv/bin/activate
pip install -e .

Windows PowerShell:

git clone https://github.com/your-name/rag-paper.git
Set-Location rag-paper
py -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e .

Quick Start

Create a config file:

rag-paper init-config --path ./config.json

Put PDFs under ./papers, or edit config.json and point papers[].root_path to your own directories.

Index PDFs:

rag-paper index

Search locally:

rag-paper search "retrieval augmented generation evaluation" --top-k 5

Inspect indexed papers:

rag-paper list-indexed-papers
rag-paper show-indexed-paper "attention" --limit 3

Delete a bad indexed record:

rag-paper delete-indexed-paper "10.55277/researchhub.x9vnpm0y.1"
rag-paper build-citation-graph

Start the MCP server:

rag-paper serve

Tech Stack

PyMuPDF: PDF text extraction and PDF metadata reading.
ChromaDB: local persistent vector database for paper chunks.
Rank BM25: keyword retrieval path for hybrid search.
MCP Python SDK: exposes paper search and inspection tools to Codex CLI, Claude Code, and other MCP clients.

Modules

Indexing

rag-paper index scans configured PDF roots, respects skip_marker_file, extracts text, chunks papers, creates embeddings, and writes chunks to Chroma. Before vectorization it shows the number of files and the root paths they came from unless --yes or indexing.assume_yes is enabled.

Indexing is incremental. rag-paper stores size + mtime_ns in the manifest and only computes SHA256 when the quick file signature changes. Failed files are recorded in a JSONL log and can be retried with rag-paper index --retry-failed.

Search and Retrieval

rag-paper search performs local hybrid retrieval over indexed chunks. It combines vector similarity from Chroma with BM25 keyword matching, then returns compact source-aware excerpts that are suitable for sending to an LLM instead of whole PDFs.

Search supports filters such as author, year, tag, and file name. The same retrieval engine powers the MCP tools.

Indexed Paper Inspection

rag-paper list-indexed-papers shows the current Chroma library as a terminal table, including total paper count, chunk count, title, DOI, year, and source path.

rag-paper show-indexed-paper supports fuzzy selectors over title, file name, source path, and DOI. It shows merged metadata and the first 5 chunk IDs by default; use --all-chunks to show every chunk ID.

rag-paper delete-indexed-paper deletes matching papers from Chroma and the index manifest. Without --yes, every matched paper must be confirmed one by one before deletion. After deleting indexed papers, run rag-paper build-citation-graph to refresh citation graph exports.

Metadata Enrichment

rag-paper enrich-metadata enriches indexed papers with DOI and bibliographic metadata. It supports CrossRef and OpenAlex, provider fallback, rate limiting, custom User-Agent, contact email, and HTTP/HTTPS/SOCKS5 proxies.

The enrichment module is decoupled from vectorization. It can run per indexed file, after indexing finishes, or manually. Results are written to paper_metadata.json and provider responses are cached in SQLite.

Title quality checks reject obvious spam, URLs, ad-like strings, and symbol-heavy PDF metadata titles. When a PDF title is not trusted, rag-paper prefers a title inferred from the first page, a provider title, or the file name.

Deduplication

rag-paper dedupe-papers reports duplicate candidates before indexing. It can compare DOI/title-year metadata and optional semantic signatures built from paper text or abstracts. Depending on configuration, duplicates can be reported or skipped.

Citation Graph

rag-paper build-citation-graph builds a graph from enriched DOI/OpenAlex metadata and exports JSON plus Mermaid Markdown. The Mermaid output is designed to work well in Obsidian for browsing paper relationships.

MCP Server

rag-paper serve starts an MCP server so external tools can query the local paper library. MCP clients can import papers, search chunks, list indexed papers, inspect metadata, delete indexed records, enrich metadata, deduplicate papers, fetch specific chunks, export context, and build citation graphs.

Typical Workflows

Use with Zotero

root_path is an array, so you can point rag-paper at multiple paper folders, including Zotero storage/export folders:

{
  "papers": [
    {
      "root_path": [
        "D:/Zotero/storage",
        "D:/Zotero/exports/LLM"
      ],
      "skip_marker_file": ".rag-paper-skip",
      "tags": ["zotero"]
    }
  ]
}

rag-paper recursively scans these roots and indexes only .pdf files.

Protect private folders with `skip_marker_file`

If a directory contains the configured marker file, rag-paper skips that directory and all of its children.

This is useful for privacy protection. For example, you can place .rag-paper-skip in folders containing unpublished papers, private notes, or papers that should not be exposed through MCP search.

{
  "papers": [
    {
      "root_path": ["./papers"],
      "skip_marker_file": ".rag-paper-skip"
    }
  ]
}

When a marker is detected, rag-paper highlights the warning and asks whether to continue unless --yes or indexing.assume_yes is enabled.

Back up and restore work

By default, core runtime data is stored under:

rag_paper_data/
  chroma_db/
  paper_metadata.json
  cache/
  citation_graph/
  logs/

Copying rag_paper_data/ is enough to back up the local Chroma vectors, metadata, cache, failure logs, retrieval stats, and citation graph exports when default paths are used.

To restore work on another device:

Install rag-paper.
Copy rag_paper_data/ into the project directory.
Copy your config.json if you customized paths.
Run rag-paper list-indexed-papers to verify the restored index.

Build a citation graph for Obsidian

After metadata enrichment, build a citation graph:

rag-paper build-citation-graph

rag-paper exports:

JSON graph: rag_paper_data/citation_graph/citation_graph.json
Mermaid Markdown: rag_paper_data/citation_graph/citation_graph.md

The Mermaid file can be opened directly in Obsidian or any Markdown tool with Mermaid support.

Metadata Enrichment

rag-paper can enrich indexed papers with DOI and bibliographic metadata.

Supported providers:

CrossRef
OpenAlex

Default order:

{
  "metadata_enrichment": {
    "providers": ["crossref", "openalex"]
  }
}

If the first provider fails or returns no match, rag-paper falls back to the next provider.

Run enrichment manually:

rag-paper enrich-metadata

Refresh existing metadata:

rag-paper enrich-metadata --force

Refresh one indexed PDF:

rag-paper enrich-metadata --file /path/to/paper.pdf --force

--file first checks whether the PDF has already been indexed in local Chroma. If not, rag-paper exits and asks you to index it first.

Metadata enrichment uses a SQLite cache by default:

rag_paper_data/cache/metadata_enrichment.sqlite3

This avoids repeatedly calling CrossRef/OpenAlex for the same DOI or title query.

MCP Usage

Start the MCP server:

rag-paper serve

Streamable HTTP configuration example:

{
  "mcpServers": {
    "rag-paper": {
      "url": "http://127.0.0.1:8765/mcp"
    }
  }
}

stdio configuration example:

{
  "mcpServers": {
    "rag-paper": {
      "command": "rag-paper",
      "args": ["serve"]
    }
  }
}

Available MCP tools include:

service_info
import_papers
list_indexed_papers
show_indexed_paper
delete_indexed_paper
search_papers
search_by_metadata
get_chunk
export_context
enrich_paper_metadata
dedupe_papers
build_paper_citation_graph

CLI Commands

rag-paper init-config
rag-paper index
rag-paper index --force
rag-paper index --file /path/to/paper.pdf
rag-paper index --only-new
rag-paper index --retry-failed
rag-paper enrich-metadata
rag-paper list-indexed-papers
rag-paper show-indexed-paper "selector"
rag-paper delete-indexed-paper "selector"
rag-paper search "query"
rag-paper dedupe-papers
rag-paper build-citation-graph
rag-paper serve

Notes on Indexing

rag-paper stores indexing state in:

rag_paper_data/chroma_db/index_manifest.json

It uses a two-stage change check:

Compare size + mtime_ns.
Compute SHA256 only when the quick check changed.

If indexing fails for a PDF, rag-paper records the failure in:

rag_paper_data/logs/index_failed.jsonl

Retry failed files:

rag-paper index --retry-failed

If you press Ctrl+C, completed files are already persisted in Chroma and the manifest, so the next run continues from the remaining files.

Development

pip install -e ".[dev]"
python -m pytest -q

License

This project is licensed under the MIT License. See LICENSE.

Configuration Reference

Common options:

data_dir: core runtime data directory. Default: ./rag_paper_data
papers[].root_path: array of PDF root directories. Useful with Zotero folders.
papers[].skip_marker_file: marker filename used to skip private directories.
papers[].tags: default tags for papers under the root paths.
chroma.persist_dir: Chroma persistence directory.
indexing.metadata_path: paper metadata JSON path.
indexing.assume_yes: skip interactive confirmation.
indexing.max_files: maximum PDFs to index.
indexing.failed_path: index failure JSONL path.
metadata_enrichment.enabled: enable DOI and metadata enrichment.
metadata_enrichment.providers: provider order, e.g. ["crossref", "openalex"].
metadata_enrichment.timing: per_file, after_index, or manual.
metadata_enrichment.user_agent: User-Agent for CrossRef/OpenAlex requests.
metadata_enrichment.mailto: contact email for provider etiquette.
metadata_enrichment.openalex_email: OpenAlex email parameter.
metadata_enrichment.requests_per_second: provider request rate limit.
metadata_enrichment.http_proxy: HTTP proxy.
metadata_enrichment.https_proxy: HTTPS proxy.
metadata_enrichment.socks5_proxy: SOCKS5 proxy.
metadata_enrichment.cache_path: SQLite enrichment cache path.
dedup.enabled: enable duplicate report before indexing.
dedup.action: report or skip.
dedup.similarity_threshold: semantic duplicate threshold.
citation_graph.path: citation graph JSON output.
citation_graph.mermaid_path: Mermaid Markdown output for Obsidian.
display.datetime_timezone: display timezone for metadata_enriched_at.
display.datetime_format: strftime format for displayed datetimes.
logging.level: log level.
logging.stats_path: retrieval stats JSONL path.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

rag-paper

README

rag-paper

Keywords

Why rag-paper

Requirements

Installation

Quick Start

Tech Stack

Modules

Indexing

Search and Retrieval

Indexed Paper Inspection

Metadata Enrichment

Deduplication

Citation Graph

MCP Server

Typical Workflows

Use with Zotero

Protect private folders with skip_marker_file

Back up and restore work

Build a citation graph for Obsidian

Metadata Enrichment

MCP Usage

CLI Commands

Notes on Indexing

Development

License

Configuration Reference

Recommended Servers

Protect private folders with `skip_marker_file`