EOSC Data Commons Search

EOSC Data Commons Search

Enables natural language search and discovery of open-access scientific datasets through the EOSC Data Commons OpenSearch service. Provides tools to search datasets and retrieve file metadata using LLM-assisted queries.

Category
Visit Server

README

🔭 EOSC Data Commons Search server

Build Docker image <!-- PyPI - Version PyPI - Python Version -->

A server for the EOSC Data Commons project MatchMaker service, providing natural language search over open-access datasets. It exposes an HTTP POST endpoint and supports the Model Context Protocol (MCP) to help users discover datasets and tools via a Large Language Model–assisted search.

🧩 Endpoints

The HTTP API comprises 2 main endpoints:

  • /mcp: MCP server that searches for relevant data to answer a user question using the EOSC Data Commons OpenSearch service
    • Uses Streamable HTTP transport
    • Available tools:
      • [x] Search datasets
      • [x] Get metadata for the files in a dataset (name, description, type of files)
      • [ ] Search tools
      • [ ] Search citations related to datasets or tools
  • /chat: HTTP POST endpoint (JSON) for chatting with the MCP server tools via an LLM provider (API key provided through env variable at deployment)
    • Streams Server-Sent Events (SSE) response complying with the AG-UI protocol.

[!TIP]

It can also be used just as a MCP server through the pip package.

🔌 Connect client to MCP server

The system can be used directly as a MCP server using either STDIO, or Streamable HTTP transport.

[!WARNING]

You will need access to a pre-indexed OpenSearch instance for the MCP server to work.

Follow the instructions of your client, and use the /mcp URL of your deployed server (e.g. http://localhost:8000/mcp)

To add a new MCP server to VSCode GitHub Copilot:

  • Open the Command Palette (ctrl+shift+p or cmd+shift+p)
  • Search for MCP: Add Server...
  • Choose HTTP, and provide the MCP server URL http://localhost:8000/mcp

Your VSCode mcp.json should look like:

{
    "servers": {
        "data-commons-search-http": {
            "url": "http://localhost:8000/mcp",
            "type": "http"
        }
    },
    "inputs": []
}

Or with STDIO transport:

{
   "servers": {
      "data-commons-search": {
         "type": "stdio",
         "command": "uvx",
         "args": ["data-commons-search"],
         "env": {
            "OPENSEARCH_URL": "OPENSEARCH_URL"
         }
      }
   }
}

Or using local folder for development:

{
   "servers": {
      "data-commons-search": {
         "type": "stdio",
         "cwd": "~/dev/data-commons-search",
         "env": {
            "OPENSEARCH_URL": "OPENSEARCH_URL"
         },
         "command": "uv",
         "args": ["run", "data-commons-search"]
      }
   }
}

🛠️ Development

[!IMPORTANT]

Requirements:

  • [x] uv, to easily handle scripts and virtual environments
  • [x] docker, to deploy the OpenSearch service (or just access to a running instance)
  • [x] API key for a LLM provider: e-infra CZ, Mistral.ai, or OpenRouter

📥 Install dev dependencies

uv sync --extra agent

Install pre-commit hooks:

uv run pre-commit install

Create a keys.env file with your LLM provider API key(s):

EINFRACZ_API_KEY=YOUR_API_KEY
MISTRAL_API_KEY=YOUR_API_KEY
OPENROUTER_API_KEY=YOUR_API_KEY

⚡️ Start dev server

Start the server in dev at http://localhost:8000, with MCP endpoint at http://localhost:8000/mcp

uv run uvicorn src.data_commons_search.main:app --log-config logging.yml --reload

Default OPENSEARCH_URL=http://localhost:9200

Customize server configuration through environment variables:

SERVER_PORT=8001 OPENSEARCH_URL=http://localhost:9200 uv run uvicorn src.data_commons_search.main:app --host 0.0.0.0 --port 8001 --log-config logging.yml --reload

[!TIP]

Example curl request:

curl -X POST http://localhost:8000/chat \
	-H "Content-Type: application/json" -H "Authorization: SECRET_KEY" \
	-d '{"messages": [{"role": "user", "content": "Educational datasets from Switzerland covering student assessments, language competencies, and learning outcomes, including experimental or longitudinal studies on pupils or students."}], "model": "einfracz/qwen3-coder"}'

Recommended model per supported provider:

  • einfracz/qwen3-coder or einfracz/gpt-oss-120b (smaller, faster)
  • mistralai/mistral-medium-latest (large is older, and not as good with tool calls)
  • groq/moonshotai/kimi-k2-instruct
  • openai/gpt-4.1

[!IMPORTANT]

To build and integrate the frontend web app to the server, from the frontend folder run:

npm run build && rm -rf ../data-commons-search/src/data_commons_search/webapp/ && cp -R dist/spa/ ../data-commons-search/src/data_commons_search/webapp/

📦 Build for production

Build binary in dist/

uv build

🐳 Deploy with Docker

Create a keys.env file with the API keys:

EINFRACZ_API_KEY=YOUR_API_KEY
MISTRAL_API_KEY=YOUR_API_KEY
OPENROUTER_API_KEY=YOUR_API_KEY
SEARCH_API_KEY=SECRET_KEY_YOU_CAN_USE_IN_FRONTEND_TO_AVOID_SPAM

[!TIP]

SEARCH_API_KEY can be used to add a layer of protection against bots that might spam the LLM, if not provided no API key will be needed to query the API.

You can use the prebuilt docker image ghcr.io/eosc-data-commons/data-commons-search:main

Example compose.yml:

services:
  mcp:
    image: ghcr.io/eosc-data-commons/data-commons-search:main
    ports:
      - "127.0.0.1:8000:8000"
    environment:
      OPENSEARCH_URL: "http://opensearch:9200"
      EINFRACZ_API_KEY: "${EINFRACZ_API_KEY}"

Build and deploy the service:

docker compose up

[!IMPORTANT]

Current deployment to staging server is done automatically through GitHub Actions at each push to the main branch.

When a push is made the workflow will:

  • Pull the main branch from the frontend repository
  • Build the frontend, and add it to src/data_commons_search/webapp
  • Build the docker image for the server
  • Publish the docker image as main/latest
  • The staging infrastructure then automatically pull the latest version of the image and deploys it.

✅ Run tests

[!CAUTION]

You need to first start the server on port 8001 (see start dev server section)

uv run pytest

To display all logs when debugging:

uv run pytest -s

🧹 Format code and type check

uvx ruff format
uvx ruff check --fix
uv run mypy

♻️ Reset the environment

Upgrade uv:

uv self update

Clean uv cache:

uv cache clean

🏷️ Release process

[!IMPORTANT]

Get a PyPI API token at pypi.org/manage/account.

Run the release script providing the version bump: fix, minor, or major

.github/release.sh fix

[!TIP]

Add your PyPI token to your environment, e.g. in ~/.zshrc or ~/.bashrc:

export UV_PUBLISH_TOKEN=YOUR_TOKEN

🤝 Acknowledments

The LLM provider einfracz is a service provided by e-INFRA CZ and operated by CERIT-SC Masaryk University

Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured