scifinder-route-mcp
MCP server for indexing and searching reaction-step-level synthesis routes from local SciFinder exports, designed for Docker/NAS deployment.
README
scifinder-route-mcp
NAS-hosted MCP server for indexing and searching reaction-step-level synthesis routes from local SciFinder exports. It is designed to run long-term on Docker/NAS with a read-only inbox, durable SQLite queue fallback, optional external OCR/LLM/vector/parser/structure-recognition APIs, and an operational Admin Web UI.
GHCR visibility note: if anonymous pull fails, open GitHub → Packages →
scifinder-route-mcp→ Package settings → Change visibility → Public. The compose file is already configured forghcr.io/kettly1260/scifinder-route-mcp:latest.
Quick Start With Prebuilt Image
The published Docker image targets both linux/amd64 and linux/arm64.
git clone https://github.com/kettly1260/scifinder-route-mcp.git
cd scifinder-route-mcp
cp .env.example .env
mkdir -p nas-data nas-inbox
docker compose -f docker-compose.image.yml up -d
Then open:
Admin Web UI: http://<nas-host>:8001/
MCP SSE: http://<nas-host>:8000/sse
Put SciFinder exports into nas-inbox, then click Scan Inbox in the Admin Web UI or call the MCP scan_inbox tool. The image compose file uses image: only and does not build locally.
Local Build Deployment
docker compose up -d --build
Persistent paths:
./nas-data -> /data
./nas-inbox -> /inbox (read-only in the container)
./nas-data/uploads -> /data/uploads (HTTP upload and sidecar staging)
Parsing is asynchronous in the NAS profile. Jobs are stored durably in SQLite; after a container restart, interrupted running jobs are re-queued. Poll get_parse_job_status or list_parse_jobs until completion.
Environment and Runtime Config
Copy .env.example to .env. Docker-level settings such as published ports, volumes, container network, and restart policy belong in .env/Compose only. The Admin Web UI never edits Docker files and never controls host Docker.
Hot application config is read from /data/config.yaml; copy config.example.yaml to ./nas-data/config.yaml if desired. Hot-reloadable sections include:
server.async_jobs, server.max_workers, server.storage_backend
queue.backend, queue.redis_url
security.allow_external_paths, security.token, security.users
ingest.scan_extensions
integrations.*
extraction.llm_schema_version, extraction.llm_prompt_profile, extraction.llm_cost_limit_usd
thresholds.verification_confidence_threshold
retention.evidence_retention_days, retention.cache_retention_days
Use MCP tools get_config, update_config, validate_config, and reload_config, or use the Admin Web UI.
Admin Web UI
The Admin Web UI provides operational controls for:
- health/status cards and mounted storage diagnostics
- token-protected config changes
- queue status, recent jobs, failed-job retry
- HTTP upload endpoint for sidecar/client upload
- LLM endpoint/model/enable toggle, schema version, prompt profile, cost limit
- embedding endpoint/model, vector rebuild, vector index status and errors
- OCR endpoint/model, OCR backlog status
- document parser endpoint/model, parser fallback and endpoint health
- structure recognition endpoint/model health
- PostgreSQL URL/backend status with SQLite fallback
- DOI low-confidence queue count
- evaluation latest metrics
- SQLite backup, retention dry-run cleanup, NAS storage usage
- compound registry count and search via MCP
MCP Tools
Implemented tools:
health_check
get_config
update_config
validate_config
reload_config
scan_inbox
register_document
upload_document
get_parse_job_status
list_parse_jobs
retry_parse_job
retry_failed_jobs
search_reaction_steps
get_reaction_step
get_reaction_provenance
record_doi_verification
reparse_document
export_evaluation_set
compute_evaluation_metrics
get_evaluation_status
rebuild_vector_index
get_vector_index_status
semantic_search_reaction_steps
search_compounds
get_compound
merge_compounds
search_by_smiles
recognize_structure_image
backup_database
get_storage_usage
cleanup_evidence_cache
test_integration_endpoint
Feature Matrix
| Area | Status | Notes |
|---|---|---|
| Docker/NAS SSE service | Implemented | Compose and prebuilt-image compose supported. |
| GHCR multi-arch image workflow | Implemented | linux/amd64, linux/arm64. GHCR package visibility may need manual public setting. |
| Read-only inbox scanning | Implemented | /inbox mounted read-only. |
| HTTP upload staging | Implemented | POST /api/upload writes to /data/uploads; hash dedupe supported. |
| Sidecar watcher | Implemented | scifinder-route-sidecar polling CLI uploads stable files. |
| Durable queue | Implemented | SQLite queue is default; restart recovery and retry tools. Redis is optional/degraded via config status, not required. |
| SQLite storage | Implemented | Source documents, jobs, reaction steps, provenance, DOI verification, vector rows, compounds, metrics. |
| PostgreSQL backend | Runnable degraded integration | SCIFINDER_ROUTE_BACKEND=postgres tests connectivity and reports status; SQLite remains active fallback unless a Postgres adapter is added for a deployment. |
| pgvector | Optional/degraded | SQLite stores embeddings as JSON and cosine-searches them; Postgres/pgvector reports endpoint/backend status. |
| PDF/HTML/MHTML/text parsing | Implemented | Built-in parser remains fallback. |
| External document parser | Implemented | /parse JSON adapter; failure falls back unless disabled. |
| OCR worker | Implemented adapter | /ocr JSON adapter for image-only PDFs/low-text docs; errors are job errors, not service crashes. |
| Rule extraction | Implemented | Candidate blocks and structured fields. |
| LLM JSON structuring | Implemented adapter | OpenAI-compatible /chat/completions; strict JSON; invalid responses fall back to rule fields with metadata error. |
| Embedding/vector index | Implemented adapter | OpenAI-compatible /embeddings; rebuild/status/semantic search tools. |
| Compound registry | Implemented | CAS/SMILES/InChIKey text extraction, alias registry, reaction roles; RDKit optional. |
| Image structure recognition | Implemented adapter | /recognize adapter creates low-confidence image candidates; does not overwrite text evidence. |
| Multi-user authorization | Implemented | viewer, operator, admin roles via SCIFINDER_ROUTE_USERS or config users. Legacy single token maps to admin. |
| Evaluation metrics | Implemented | JSONL gold-set metrics and latest metric status. |
| Backup/retention | Implemented | SQLite backup, storage usage, evidence/cache cleanup dry-run. |
| Endpoint health checks | Implemented | LLM, embedding, OCR, parser, structure recognition, Postgres. |
External API Schemas
All external services are optional. If a service is not configured or fails, the server returns a degraded/skipped/error status instead of crashing the process.
Embedding endpoint: POST <endpoint>/embeddings
{"model":"bge-m3","input":["text"]}
Expected response can be OpenAI-like:
{"data":[{"embedding":[0.1,0.2]}]}
LLM endpoint: POST <endpoint>/chat/completions, OpenAI-compatible. The assistant content must be strict JSON with reaction-step fields.
OCR endpoint: POST <endpoint>/ocr
{"model":"mineru-layout","file_path":"/data/uploads/file.pdf"}
Expected response:
{"text":"OCR text", "confidence":0.85}
Document parser endpoint: POST <endpoint>/parse
{"model":"parser-name","file_path":"/data/uploads/file.pdf"}
Expected response:
{"file_type":"pdf","title":"...","doi":"10....","chunks":[{"text":"...","page_number":1,"parser_name":"external","parser_version":"1"}]}
Structure recognition endpoint: POST <endpoint>/recognize
{"model":"decimer","image_path":"/data/evidence/page1.png"}
Expected response:
{"structures":[{"smiles":"CCO","confidence":0.7}]}
Sidecar Watcher
Create sidecar.yaml on a client machine:
watch_dir: /path/to/scifinder/exports
server_url: http://nas-host:8001
token: change-me
include_patterns:
- "*.pdf"
- "*.html"
settle_seconds: 3
upload_mode: http
poll_seconds: 2
Run:
scifinder-route-sidecar sidecar.yaml
The sidecar polls by default and does not require watchdog, making it suitable for Windows/macOS/Linux clients.
Authorization
Legacy single-token mode:
SCIFINDER_ROUTE_TOKEN=change-me
Multi-user token mode:
SCIFINDER_ROUTE_USERS=alice:viewer-token:viewer,bob:operator-token:operator,root:admin-token:admin
Roles:
viewer search/read/status
operator scan/reparse/retry/vector/evaluation/integration tests
admin config/backup/cleanup/secret operations
Development
python -m pytest -q
Optional Docker check:
docker compose build
docker compose -f docker-compose.image.yml config
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.