scifinder-route-mcp

scifinder-route-mcp

MCP server for indexing and searching reaction-step-level synthesis routes from local SciFinder exports, designed for Docker/NAS deployment.

Category
Visit Server

README

scifinder-route-mcp

NAS-hosted MCP server for indexing and searching reaction-step-level synthesis routes from local SciFinder exports. It is designed to run long-term on Docker/NAS with a read-only inbox, durable SQLite queue fallback, optional external OCR/LLM/vector/parser/structure-recognition APIs, and an operational Admin Web UI.

GHCR visibility note: if anonymous pull fails, open GitHub → Packages → scifinder-route-mcp → Package settings → Change visibility → Public. The compose file is already configured for ghcr.io/kettly1260/scifinder-route-mcp:latest.

Quick Start With Prebuilt Image

The published Docker image targets both linux/amd64 and linux/arm64.

git clone https://github.com/kettly1260/scifinder-route-mcp.git
cd scifinder-route-mcp
cp .env.example .env
mkdir -p nas-data nas-inbox
docker compose -f docker-compose.image.yml up -d

Then open:

Admin Web UI: http://<nas-host>:8001/
MCP SSE:      http://<nas-host>:8000/sse

Put SciFinder exports into nas-inbox, then click Scan Inbox in the Admin Web UI or call the MCP scan_inbox tool. The image compose file uses image: only and does not build locally.

Local Build Deployment

docker compose up -d --build

Persistent paths:

./nas-data  -> /data
./nas-inbox -> /inbox (read-only in the container)
./nas-data/uploads -> /data/uploads (HTTP upload and sidecar staging)

Parsing is asynchronous in the NAS profile. Jobs are stored durably in SQLite; after a container restart, interrupted running jobs are re-queued. Poll get_parse_job_status or list_parse_jobs until completion.

Environment and Runtime Config

Copy .env.example to .env. Docker-level settings such as published ports, volumes, container network, and restart policy belong in .env/Compose only. The Admin Web UI never edits Docker files and never controls host Docker.

Hot application config is read from /data/config.yaml; copy config.example.yaml to ./nas-data/config.yaml if desired. Hot-reloadable sections include:

server.async_jobs, server.max_workers, server.storage_backend
queue.backend, queue.redis_url
security.allow_external_paths, security.token, security.users
ingest.scan_extensions
integrations.*
extraction.llm_schema_version, extraction.llm_prompt_profile, extraction.llm_cost_limit_usd
thresholds.verification_confidence_threshold
retention.evidence_retention_days, retention.cache_retention_days

Use MCP tools get_config, update_config, validate_config, and reload_config, or use the Admin Web UI.

Admin Web UI

The Admin Web UI provides operational controls for:

- health/status cards and mounted storage diagnostics
- token-protected config changes
- queue status, recent jobs, failed-job retry
- HTTP upload endpoint for sidecar/client upload
- LLM endpoint/model/enable toggle, schema version, prompt profile, cost limit
- embedding endpoint/model, vector rebuild, vector index status and errors
- OCR endpoint/model, OCR backlog status
- document parser endpoint/model, parser fallback and endpoint health
- structure recognition endpoint/model health
- PostgreSQL URL/backend status with SQLite fallback
- DOI low-confidence queue count
- evaluation latest metrics
- SQLite backup, retention dry-run cleanup, NAS storage usage
- compound registry count and search via MCP

MCP Tools

Implemented tools:

health_check
get_config
update_config
validate_config
reload_config
scan_inbox
register_document
upload_document
get_parse_job_status
list_parse_jobs
retry_parse_job
retry_failed_jobs
search_reaction_steps
get_reaction_step
get_reaction_provenance
record_doi_verification
reparse_document
export_evaluation_set
compute_evaluation_metrics
get_evaluation_status
rebuild_vector_index
get_vector_index_status
semantic_search_reaction_steps
search_compounds
get_compound
merge_compounds
search_by_smiles
recognize_structure_image
backup_database
get_storage_usage
cleanup_evidence_cache
test_integration_endpoint

Feature Matrix

Area Status Notes
Docker/NAS SSE service Implemented Compose and prebuilt-image compose supported.
GHCR multi-arch image workflow Implemented linux/amd64, linux/arm64. GHCR package visibility may need manual public setting.
Read-only inbox scanning Implemented /inbox mounted read-only.
HTTP upload staging Implemented POST /api/upload writes to /data/uploads; hash dedupe supported.
Sidecar watcher Implemented scifinder-route-sidecar polling CLI uploads stable files.
Durable queue Implemented SQLite queue is default; restart recovery and retry tools. Redis is optional/degraded via config status, not required.
SQLite storage Implemented Source documents, jobs, reaction steps, provenance, DOI verification, vector rows, compounds, metrics.
PostgreSQL backend Runnable degraded integration SCIFINDER_ROUTE_BACKEND=postgres tests connectivity and reports status; SQLite remains active fallback unless a Postgres adapter is added for a deployment.
pgvector Optional/degraded SQLite stores embeddings as JSON and cosine-searches them; Postgres/pgvector reports endpoint/backend status.
PDF/HTML/MHTML/text parsing Implemented Built-in parser remains fallback.
External document parser Implemented /parse JSON adapter; failure falls back unless disabled.
OCR worker Implemented adapter /ocr JSON adapter for image-only PDFs/low-text docs; errors are job errors, not service crashes.
Rule extraction Implemented Candidate blocks and structured fields.
LLM JSON structuring Implemented adapter OpenAI-compatible /chat/completions; strict JSON; invalid responses fall back to rule fields with metadata error.
Embedding/vector index Implemented adapter OpenAI-compatible /embeddings; rebuild/status/semantic search tools.
Compound registry Implemented CAS/SMILES/InChIKey text extraction, alias registry, reaction roles; RDKit optional.
Image structure recognition Implemented adapter /recognize adapter creates low-confidence image candidates; does not overwrite text evidence.
Multi-user authorization Implemented viewer, operator, admin roles via SCIFINDER_ROUTE_USERS or config users. Legacy single token maps to admin.
Evaluation metrics Implemented JSONL gold-set metrics and latest metric status.
Backup/retention Implemented SQLite backup, storage usage, evidence/cache cleanup dry-run.
Endpoint health checks Implemented LLM, embedding, OCR, parser, structure recognition, Postgres.

External API Schemas

All external services are optional. If a service is not configured or fails, the server returns a degraded/skipped/error status instead of crashing the process.

Embedding endpoint: POST <endpoint>/embeddings

{"model":"bge-m3","input":["text"]}

Expected response can be OpenAI-like:

{"data":[{"embedding":[0.1,0.2]}]}

LLM endpoint: POST <endpoint>/chat/completions, OpenAI-compatible. The assistant content must be strict JSON with reaction-step fields.

OCR endpoint: POST <endpoint>/ocr

{"model":"mineru-layout","file_path":"/data/uploads/file.pdf"}

Expected response:

{"text":"OCR text", "confidence":0.85}

Document parser endpoint: POST <endpoint>/parse

{"model":"parser-name","file_path":"/data/uploads/file.pdf"}

Expected response:

{"file_type":"pdf","title":"...","doi":"10....","chunks":[{"text":"...","page_number":1,"parser_name":"external","parser_version":"1"}]}

Structure recognition endpoint: POST <endpoint>/recognize

{"model":"decimer","image_path":"/data/evidence/page1.png"}

Expected response:

{"structures":[{"smiles":"CCO","confidence":0.7}]}

Sidecar Watcher

Create sidecar.yaml on a client machine:

watch_dir: /path/to/scifinder/exports
server_url: http://nas-host:8001
token: change-me
include_patterns:
  - "*.pdf"
  - "*.html"
settle_seconds: 3
upload_mode: http
poll_seconds: 2

Run:

scifinder-route-sidecar sidecar.yaml

The sidecar polls by default and does not require watchdog, making it suitable for Windows/macOS/Linux clients.

Authorization

Legacy single-token mode:

SCIFINDER_ROUTE_TOKEN=change-me

Multi-user token mode:

SCIFINDER_ROUTE_USERS=alice:viewer-token:viewer,bob:operator-token:operator,root:admin-token:admin

Roles:

viewer   search/read/status
operator scan/reparse/retry/vector/evaluation/integration tests
admin    config/backup/cleanup/secret operations

Development

python -m pytest -q

Optional Docker check:

docker compose build
docker compose -f docker-compose.image.yml config

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured