paperstack
Enables arXiv paper search, PDF download, text extraction, and context chunking for LLM pipelines, along with advanced features like citation graphs and reproducibility scoring.
README
paperstack (Model Context Protocol)
Overview
paperstack is a production-grade Model Context Protocol (MCP) server focused on arXiv research retrieval.
It provides:
- arXiv Atom API search by ID/query
- PDF download, validation, and cache
- PDF text extraction (title, abstract, body, references)
- Token-aware context chunking for LLM pipelines
- CLI, API, and autonomous agent integration support
Table of Contents
- Quickstart
- Installation
- Usage
- MCP Server
- Project structure
- Configuration
- Testing
- Troubleshooting
- Contributing
- License
Quickstart
1. Clone repository
git clone https://github.com/Aldrin-Joan/paperstack.git
cd paperstack
2. Set up Python environment (recommended)
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate
3. Install dependencies
pip install -r requirements.txt
4. Run smoke test
python test_smoke.py
Installation
From source:
pip install -e .
From PyPI:
pip install paperstack-mcp
Usage
CLI
paperstack --help
Run server locally:
python -m src.mcp_server
Python API
from paperstack_mcp import entrypoint # import alias for the package
from src.arxiv_client import ArxivClient
from src.pdf_fetcher import PdfFetcher
from src.pdf_parser import PdfParser
from src.context_builder import ContextBuilder
client = ArxivClient()
results = client.search('quantum computing', max_results=3)
pdf_path = PdfFetcher().fetch_paper(results[0].id)
parsed = PdfParser().parse(pdf_path)
context = ContextBuilder().build(parsed)
print(context.summary)
Architecture Layers
| Layer | Features |
|---|---|
| Layer 1 — retrieval (both tools have this) | Search · PDF fetch + cache · Text extraction + chunking |
| Layer 2 — intelligence (your opportunity) | Citation graph · Concept extraction · Cross-paper synthesis |
| Layer 3 — dev tooling (highly unique) | Code + dataset links · Implementation diff · Reproducibility audit |
| Layer 4 — research workflows (unique) | Reading lists · Topic tracking + alerts · Agent-ready Q&A |
MCP Server
src/mcp_server/__main__.py starts an MCP tool server exposing:
arxiv_search(query or ID expand)arxiv_fetch_pdf(download + cache)arxiv_parse_pdf(extract text and metadata)arxiv_build_context(chunk to LLM-friendly context)arxiv_citation_graph(author/paper citation network)arxiv_extract_contributions(structured contribution extractor)arxiv_semantic_index(semantic similarity index builder/query)arxiv_compare_papers(paper comparison report)arxiv_extract_code_links(discover official GitHub/HuggingFace/Kaggle links from a paper)arxiv_reproducibility_score(reproducibility heuristic score with evidence details)arxiv_diff_implementations(compare paper method claims against a GitHub implementation)arxiv_reading_list(persistent reading list CRUD and filters)arxiv_watch_topic(watch query topics and detect new papers)arxiv_explain_for_audience(audience-specific explanation synthesis)
Use any MCP-capable client (VS Code MCP extension, custom agent SDK) to connect.
VS Code MCP server setup
In VS Code, add an MCP server entry to your workspace settings (e.g., .vscode/settings.json):
{
"servers": {
"arxiv-mcp": {
"command": "D:/Softwares/Anaconda3/python.exe",
"args": ["-m", "src.mcp_server"],
"cwd": "${workspaceFolder}",
"env": {
"PYTHONPATH": "${workspaceFolder}",
"ARXIV_DOWNLOAD_DIR": "${workspaceFolder}/downloads",
"ARXIV_KEEP_PDFS": "true",
"CHUNK_SIZE_TOKENS": "800",
"CHUNK_OVERLAP_TOKENS": "100",
"ARXIV_RATE_LIMIT_DELAY": "3.0",
"MAX_RETRIES": "3",
"HTTP_TIMEOUT": "60"
}
}
}
}
MCP JSON entry for paperstack-mcp
If you installed from PyPI (pip install paperstack-mcp), the MCP server command can be the package executable instead of a direct Python module path. In .vscode/mcp.json or your .code-workspace settings, use an entry like:
{
"servers": {
"paperstack-mcp": {
"command": "paperstack-mcp",
"args": [],
"cwd": "C:\\path\\to\\your\\project",
"env": {
"PYTHONPATH": "C:\\path\\to\\your\\project",
"ARXIV_DOWNLOAD_DIR": "C:\\path\\to\\your\\project\\downloads",
"ARXIV_KEEP_PDFS": "false",
"CHUNK_SIZE_TOKENS": "800",
"CHUNK_OVERLAP_TOKENS": "100",
"ARXIV_RATE_LIMIT_DELAY": "3.0",
"MAX_RETRIES": "3",
"HTTP_TIMEOUT": "60"
}
}
}
}
Adjust values for your local path, rate limit, and retry/timeouts.
- Run
pip install paperstack-mcpfirst. - Ensure workspace
cwdandPYTHONPATHpoint to the project root. - Customize
ARXIV_DOWNLOAD_DIRfor your downloaded PDF cache location.
Adjust values for your local path, rate limit, and retry/timeouts.
-
Run
pip install paperstack-mcpfirst. -
Ensure workspace
cwdandPYTHONPATHpoint to the project root. -
Customize
ARXIV_DOWNLOAD_DIRfor your downloaded PDF cache location. -
ARXIV_DOWNLOAD_DIR: local storage for downloaded PDFs. -
ARXIV_KEEP_PDFS: keep cached PDFs after parse. -
CHUNK_SIZE_TOKENS/CHUNK_OVERLAP_TOKENS: controls text-chunking in context builder. -
ARXIV_RATE_LIMIT_DELAY: delay between arXiv API calls. -
MAX_RETRIES,HTTP_TIMEOUT: network robustness.
You can apply this configuration also in other compatible MCP clients using their server configuration schema.
Project structure
src/- package sourcearxiv_client/- arXiv Atom API logicpdf_fetcher/- download/cache PDFpdf_parser/- extract/clean PDF textcontext_builder/- tokenization + chunkingmcp_server/- MCP protocol/adapters
tests/- pytest suiterequirements.txt- dependenciespyproject.toml- package metadata
Configuration
Environment variables:
ARXIV_CACHE_DIR(default:./downloads)ARXIV_CACHE_TTL(default:604800seconds / 7 days)ARXIV_DB_PATH(default:${ARXIV_DOWNLOAD_DIR}/arxiv_mcp.db) path to the SQLite workflow databaseARXIV_RATE_LIMIT(default:1request/sec)S2_API_KEY(optional; Semantic Scholar API key for higher rate limits)OLLAMA_BASE_URL(default:http://localhost:11434)OLLAMA_MODEL(default:mistral)SEMANTIC_INDEX_DIR(default:${ARXIV_DOWNLOAD_DIR}/semantic_index)CITATION_CACHE_TTL(default:86400seconds / 24 hours)CONTRIBUTION_CACHE_TTL(default:604800seconds / 7 days)EMBEDDING_MODEL(default:sentence-transformers/all-MiniLM-L6-v2)GITHUB_TOKEN(optional; for GitHub API auth, improves 60 -> 5000 req/hour)LINK_CACHE_TTL(default:172800seconds / 48 hours)REPRO_CACHE_TTL(default:604800seconds / 7 days)DIFF_CACHE_TTL(default:86400seconds / 24 hours)GITHUB_MAX_FILES(default:20)GITHUB_MAX_FILE_SIZE_KB(default:50)
Set in shell or via .env before running.
Testing
Run full tests:
pytest -q
Smoke test:
python test_smoke.py
Troubleshooting
arxiv-mcpcommand not found: ensure virtualenv is active and package installed- PDF download failure: check network access to
https://arxiv.org/pdf/ - Rate-limit errors: lower request frequency or adjust
ARXIV_RATE_LIMIT - Topic duplicates observed after repeated tests: use
DatabaseClient.reset()on workflow DB and/ortopic_watcher.addnow enforces dedupe by(query, label). - Reading list duplicate notes:
ReadingListManager.addnow avoids re-appending identical note blocks. - Ollama not available fallback:
_passthroughnow uses arXivmetadata.abstractfor all explanation fields (what_it_is/problem_solved/how_it_works/why_it_matters/key_result). - Dependency pin check:
pip install -r requirements.txtincludesprotobuf==3.20.3andurllib3>=2.0.0,<3to avoid known warning/conflict cases (TensorFlow + ChromaDBMessageFactoryand RequestsRequestsDependencyWarning). - Smoke harness summary:
scripts/run_all_tools.pyprints final status with count of run/passed/failed tools.
Contributing
- Fork repo
- Create feature branch
- Add tests and update README
- Open PR
Follow style checks (Black, formatting and lint).
License
Apache-2.0
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.