data-aggregator-mcp

data-aggregator-mcp

Searches and fetches research datasets across Zenodo, DataCite (Dryad/Figshare/Dataverse/OSF), NCBI omics archives (GEO/SRA/BioProject), and the literature (PubMed/OpenAIRE) through one normalized model β€” deduplicating by DOI, expanding organism queries with NCBI Taxonomy synonyms, and bridging papers to the datasets they produced. Resolves citations and open-access full text, and downloads files.

Category
Visit Server

README

πŸ”Ž data-aggregator-mcp

One MCP server to find and fetch research data across archives, omics registries, and literature β€” behind a single normalized model.

PyPI Python License: MIT CI Glama

search one query across Zenodo, DataCite (Dryad / Figshare / Dataverse / OSF / Mendeley), NCBI omics (GEO / SRA / BioProject), and literature (PubMed / OpenAIRE) β€” deduplicated, normalized, and cross-linked. resolve any hit to its file manifest, citation, and the data it points at. fetch it to disk with checksum verification.

mcp-name: io.github.musharna/data-aggregator-mcp

<p align="center"> <img src="examples/assets/demo.svg" alt="data-aggregator-mcp stdio demo β€” initialize, tools/list (search, resolve, fetch, list_sources), and a live list_sources call showing the four wired sources" width="820"> </p>

✨ Why this

Most data MCPs wrap a single source. This one unifies them behind four tools and one DataResource model, so an agent searches once and gets back comparable records:

  • Multi-domain, one model β€” generalist archives + raw omics + literature, deduplicated by DOI (the fetchable record wins over bare metadata).
  • Taxonomy synonym expansion β€” organism="Orobanche aegyptiaca" also matches Phelipanche aegyptiaca (NCBI Taxonomy), so a species rename doesn't cost you results.
  • Paper β†’ data bridge β€” resolve a paper and get links to the GEO / SRA / BioProject / DataCite records it produced.
  • Verified fetch β€” streams to disk with md5 verification where the source exposes a checksum, optional archive unpacking, and a fail-loud integrity sniff that rejects an HTML paywall page served as a "PDF".
  • Citations, access & full text β€” render a citation in any CSL style, get normalized access/license, and pull open-access full text β€” all in one resolve.

⚑ Quickstart

Run with no install:

uvx data-aggregator-mcp

Register with Claude Code:

claude mcp add data-aggregator -- uvx data-aggregator-mcp

A typical agent flow:

search("drought stress RNA-seq", organism="Sorghum bicolor")
  β†’ [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ]   # deduped, taxa-normalized

resolve("sra:SRX079566")
  β†’ DataResource{ files: [ENA FASTQ urls…], access: "open", taxa: [...] }

fetch("sra:SRX079566", dest="./data")
  β†’ ["./data/SRX079566_1.fastq.gz", …]                   # md5-verified

<details> <summary>Other ways to run (pip, python -m, raw client config)</summary>

pip install data-aggregator-mcp
data-aggregator-mcp        # or: python -m data_aggregator_mcp

Add to a client's MCP config (e.g. Claude Desktop claude_desktop_config.json):

{
  "mcpServers": {
    "data-aggregator": {
      "command": "uvx",
      "args": ["data-aggregator-mcp"],
      "env": { "NCBI_API_KEY": "your-optional-key" }
    }
  }
}

</details>

πŸ—‚οΈ Sources

Source Discover Fetch Checksum
Zenodo βœ… βœ… md5
DataCite β†’ Figshare βœ… βœ… md5
DataCite β†’ Dataverse βœ… βœ… md5
DataCite β†’ OSF βœ… βœ… md5
DataCite β†’ Dryad βœ… manifest onlyΒΉ sha-256 (listed)
DataCite β†’ Mendeley & others βœ… β€” β€”
NCBI SRA βœ… βœ… (ENA FASTQ) md5
NCBI GEO βœ… βœ… (suppl/) noneΒ²
NCBI BioProject βœ… β†’ SRA links β€”
PubMed / OpenAIRE βœ… βœ… (OA full text) noneΒ²

ΒΉ Dryad downloads are token / bot-challenge gated, so fetch fails loud; resolve still lists the files. Β² No upstream checksum β€” fetch verifies content-type instead (rejects an HTML page served in place of a binary).

πŸ› οΈ Tools

search(query, size?, sources?, organism?)

Fan out across all wired sources in parallel and return compact DataResource records, deduped by DOI. Per-source failures land in errors{} β€” never silently dropped.

  • organism β€” expand the query with NCBI-Taxonomy synonyms; the expansion is echoed in taxon_expansion, and results carry normalized taxa[] ({taxid, name}) plus a described_in link to plant-genomics-mcp for plant taxa.
  • sources β€” restrict the fan-out, e.g. ["omics"].
  • size β€” max results (1–50).

resolve(id)

Full record + files manifest. Routes by id shape β€” zenodo:7654321, a bare DOI, datacite:10.5061/dryad.x, an omics id (sra:SRX079566, geo:GSE332789, bioproject:PRJNA1468572), or a literature id (pubmed:34320281, openaire:<id>). Attaches, where available:

  • files[] β€” ENA FASTQ manifest (SRA), GEO suppl/, or the host repo's native manifest (Figshare / Dataverse / OSF / Dryad).
  • links[] β€” paper β†’ data: pubmed: β†’ sra: / geo: / bioproject: (NCBI elink); openaire: β†’ datacite: (ScholeXplorer Scholix).
  • access / license β€” normalized status (open / embargoed / restricted / closed / unknown) and license where the source exposes it.
  • identifiers β€” normalized {pmid, pmcid, doi}, plus an open-access full-text FileEntry (EuropePMC XML, or an Unpaywall PDF fallback) for papers.
  • citation β€” pass cite=<format>: bibtex, ris, csl-json, or any CSL style name (apa, mla, vancouver, …). DOI records use content negotiation; others render CSL-JSON from metadata. Off by default; failures degrade quietly.

fetch(id, dest?, files?, max_bytes?, force?, extract?)

Download files to disk and return their paths. Streams under a max_bytes guard (force to override) with md5 verification wherever a checksum exists.

  • files β€” restrict to a subset of the resolved manifest.
  • extract β€” unpack downloaded zip / tar archives in place, guarded against path traversal and runaway extracted size. Off by default.
  • Unverified fetches (GEO suppl/, literature full text) get a content-type sniff that fails loud if a declared binary is actually an HTML page.
  • Fetchable: Zenodo, SRA, GEO, DataCite-hosted Figshare / Dataverse / OSF, and literature open-access full text. Dryad and other DataCite repos are discovery-only and raise FetchNotSupportedError.

list_sources()

Wired sources with their capabilities β€” layer, kinds, supported filters, fetchability, id examples, auth, and rate limits.

βš™οΈ Configuration

Both optional, set via environment variables:

  • NCBI_API_KEY β€” raises the NCBI E-utilities rate limit (3 β†’ 10 req/s) used by the omics, literature, and taxonomy lookups.
  • UNPAYWALL_EMAIL β€” enables the Unpaywall fallback leg of literature full-text retrieval (the EuropePMC leg works without it).

πŸ§ͺ Develop

uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q   # real-API probes

The README demo (examples/assets/demo.svg) is recorded network-free from examples/_demo_stdio.py β€” see the header of that file to re-record.

License

MIT β€” see LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured