docshelf-mcp
docshelf-mcp manages AI-friendly document collections β converts PDFs and Markdown into chapter-split shelves with a single navigation INDEX, so AI agents can fetch only the relevant section by raw URL instead of choking on a 4 MB datasheet. Repo: https://github.com/ignatenkofi/docshelf-mcp
README
docshelf-mcp
Put your manuals on a shelf, hand the AI the index.
π Docs & landing page: https://ignatenkofi.github.io/docshelf-mcp/
___ __ ____ ____ _ _ ____ __ ____
/ __)/ \(_ _)/ ___)/ )( \( __)( ) ( __)
( (_ \( O ) )( \___ \) __ ( ) _) / (_/\ ) _)
\___/ \__/ (__) (____/\_)(_/(____)\____/(__)
MCP server for AI-friendly doc shelves
An MCP server that turns a folder of PDFs and Markdown into a chat-project-friendly document collection: AI agents see a single INDEX.md and pull individual sections by raw GitHub URL on demand β instead of choking on a 4 MB datasheet.
Why?
You have 30 hardware manuals, or 200 cooking recipes, or a stack of research PDFs.
You want Claude / ChatGPT / whatever to be able to answer questions across them β but:
- β You can't dump 80 MB of PDFs into a chat project. It won't fit, and you'd burn the context window even if it did.
- β You can manually copy-paste the relevant pages, but only after you remember which manual mentioned the thing you need.
- β Long files mean retrieval is wasteful β the model loads the whole RouterOS guide just to answer a question about VLANs.
docshelf-mcp solves it like this:
- You drop a PDF onto the shelf.
- The shelf converts it to Markdown, splits big files chapter-by-chapter, and regenerates a navigation
INDEX.md. - You commit and push to a public GitHub repo.
- Add only
INDEX.mdto your Claude project. When the model needs a section, it fetches it viaraw.githubusercontent.com.
Result: a 5 KB index pointing at a 50 MB collection. The model reads exactly the chapter it needs.
π¦ Install
From PyPI (once the first tagged release is published):
# uv (recommended)
uv pip install docshelf-mcp
# or plain pip
pip install docshelf-mcp
Or straight from main (always-latest, no PyPI required):
pip install "git+https://github.com/ignatenkofi/docshelf-mcp"
Optional high-quality PDF engine (pulls ~2 GB of PyTorch β only if you need it):
pip install "docshelf-mcp[high-quality]"
π Project Prompt
Drop this into the Custom Instructions of any Claude project that consumes
a docshelf-style INDEX.md:
This project uses the docshelf pattern.
INDEX.mdis the entry point. When answering: read INDEX β fetch ONLY the needed section file via its GitHub raw URL (use WebFetch / fetch / curl). Don't load full source files into context. For large manuals split into chapters, follow INDEX β chapter SUBINDEX β section file.
Medium (~150 words) and full (~400 words) versions, plus how-to snippets for
Claude Code, Claude Desktop, and the Anthropic API, live in
docs/PROJECT_PROMPT.md.
Quickstart (Python library)
from docshelf_mcp import Shelf
shelf = Shelf("~/Documents/my-homelab-docs").init(
name="My HomeLab Docs",
remote="https://github.com/me/my-homelab-docs",
default_categories=["routers", "switches", "psu", "motherboards"],
)
shelf.add_document(
"~/Downloads/MIKROTIK_RouterOS.pdf",
category="routers",
title="Mikrotik RouterOS β full manual",
description="Official RouterOS reference, split by chapter.",
)
# β docs/routers/mikrotik-routeros-full-manual.md + docs/routers/.../001-..md, 002-..md, ...
# β INDEX.md is regenerated automatically.
Then in the shelf directory: git add . && git commit -m "docs: add RouterOS" && git push.
In your Claude project, attach only INDEX.md. Done.
Quickstart (MCP server)
1. Add to Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%/Claude/claude_desktop_config.json (Windows):
{
"mcpServers": {
"docshelf": {
"command": "docshelf-mcp",
"env": {
"DOCSHELF_ROOT": "/Users/me/Documents/my-homelab-docs"
}
}
}
}
Restart Claude Desktop. You now have six new tools available:
| Tool | What it does |
|---|---|
docshelf_init_shelf |
Bootstrap a new shelf directory. |
docshelf_add_document |
Add a PDF/MD file. Converts, splits, re-indexes. |
docshelf_rebuild_index |
Regenerate INDEX.md from disk. |
docshelf_search |
Plain-text search across the shelf, with raw URLs. |
docshelf_list_documents |
List documents by category. |
docshelf_convert_pdf |
Standalone PDF β Markdown (no shelf). |
2. Add to Claude Code
claude mcp add docshelf -- docshelf-mcp
# Optional: set the default shelf
claude mcp add docshelf --env DOCSHELF_ROOT=/path/to/shelf -- docshelf-mcp
3. Test from the command line
# Sanity check β should print the server version then wait on stdin
docshelf-mcp
The shelf layout
my-shelf/
βββ .docshelf.json β shelf metadata: name, remote, category order
βββ INDEX.md β auto-generated navigation (your chat-project file)
βββ .gitignore
βββ docs/
βββ routers/
β βββ .meta.json β per-document title/description overrides
β βββ mikrotik-routeros.md (full document, lightly cleaned)
β βββ mikrotik-routeros/ (auto-split sections)
β βββ 001-overview.md
β βββ 002-bridging.md
β βββ 003-firewall.md
βββ switches/
βββ cudy-gs1010pe.md
Everything in docs/ is committed; everything is fetchable via raw URL once you push to GitHub.
How splitting works
A document is split when both conditions hold:
- UTF-8 size > 50 KB (configurable via
.docshelf.json:split_threshold_bytes). - The document has at least two
##(H2) headings.
The splitter:
- Cleans PDF-extraction noise (collapses runaway blank lines, demotes CLI dumps mistaken for H1s).
- Slices on H2 boundaries.
- Names files
NNN-<slug>.mdso they sort naturally and survive title changes. - Wipes the previous split directory before regenerating β fully idempotent.
If you want to keep a document whole, pass split=False.
Examples
See the examples/ directory for three concrete use cases:
examples/homelab/β original use case, hardware manuals for a home lab.examples/recipes/β a cookbook with one recipe per file.examples/research-papers/β academic PDFs with abstracts in.meta.json.
Each example shows the directory layout and the INDEX.md you'd end up with.
Optional: high-quality PDF conversion
The default engine (pymupdf4llm) is fast and good enough for ~95% of technical documents. For papers with complex tables, math, or scanned content, install the marker-pdf backend:
pip install "docshelf-mcp[high-quality]"
Then pass quality="high":
shelf.add_document("paper.pdf", category="research", title="...", quality="high")
β οΈ marker-pdf pulls in PyTorch (~2 GB) and is significantly slower (10β60 s per document on CPU). The library import is deferred β if you don't use quality="high", the dependency is never loaded.
FAQ
Why GitHub raw URLs and not embeddings / RAG? Because it's dead simple, costs nothing to host, and the AI is already good at chasing links. You can layer embedding search on top later if you want β the on-disk shape is a normal git repo.
Does this work with private repos?
Not for the raw-URL trick β raw.githubusercontent.com won't serve them without auth. The local search tool works fine on private shelves; you just lose the "AI fetches sections directly" benefit. Make the doc repo public (separate from your code repo).
Do I have to use GitHub?
No. The shelf is just a directory. If you don't set a github_remote, INDEX.md still gets generated β entries just won't have URLs. You can host the static files anywhere that serves raw text (S3, Cloudflare R2, GitLab raw, Gitea, β¦) and post-process URLs yourself.
Does it edit the source PDFs?
No. PDFs are converted on add_document and the source is left in place. The shelf only writes inside its own directory.
What about non-English documents?
Slugify is Unicode-aware (NFKD-normalized, with \w under re.UNICODE). Cyrillic / CJK titles slug down to ASCII-ish forms; the body Markdown is preserved as-is.
Can I use it without MCP?
Yes β from docshelf_mcp import Shelf and use the class directly. See docs/USAGE.md.
Limitations
- Public GitHub only for the raw-URL trick (or whatever public static host you wire up).
- Single repo per shelf. If you outgrow one repo, run multiple shelves and attach multiple
INDEX.mds. - Heuristic splitting. The PDFβMarkdown extract isn't always clean enough to split cleanly. For pathological cases (some 4+ MB datasheets), keep the file whole and rely on
docshelf_search. - No automatic git commit. Tools regenerate
INDEX.mdon disk, but the caller (you, or an agent) is responsible forgit add / commit / push. This is intentional β staying out of git's way keeps the tool safe to call from agents.
Demo
A short walkthrough video / GIF is planned: https://github.com/ignatenkofi/docshelf-mcp/blob/main/docs/demo.md (coming soon)
Architecture
For a deeper dive, see docs/ARCHITECTURE.md β module layout, data flow, design rationale.
Contributing
Bug reports and PRs welcome. To set up a dev env:
git clone https://github.com/ignatenkofi/docshelf-mcp
cd docshelf-mcp
uv pip install -e ".[dev]"
ruff check src tests
pytest -v
License
MIT β see LICENSE.
Origin
docshelf-mcp started life as a 350-line Python script (homelab-encyclopedia.py) that managed a single homelab manuals repo. The split / index / clean logic is the same code, generalised to work for any category-organised document collection.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.