OMOP MCP Server
Enables natural language exploration of OMOP CDM databases for concept discovery, patient count queries, and cohort SQL generation with support for multiple database backends.
README
OMOP MCP Server
A Model Context Protocol (MCP) server for intelligent OMOP Common Data Model (CDM) exploration, concept discovery, and cohort query generation.
π― What Can This MCP Do?
This MCP server enables AI assistants (Claude, custom agents) to work seamlessly with OMOP CDM databases through natural language. Below are real-world use cases to help you understand its capabilities.
π Use Cases
1. π Clinical Concept Discovery
Scenario: You need to find OMOP standard concepts for a clinical term.
What you can ask:
- "Find all OMOP concepts for Type 2 Diabetes"
- "Map ICD-10 code E11.9 to OMOP standard concepts"
- "What are the SNOMED codes for influenza?"
What happens:
- The MCP searches the ATHENA vocabulary service
- Returns standard concepts with IDs, names, vocabularies, and domains
- Shows relationships (parent concepts, mappings, child concepts)
- Filters by domain (Condition, Drug, Procedure) if specified
Example output:
{
"concepts": [
{
"concept_id": 201826,
"concept_name": "Type 2 diabetes mellitus",
"domain_id": "Condition",
"vocabulary_id": "SNOMED",
"standard_concept": "S"
}
],
"relationships": {...}
}
2. π Patient Count Queries
Scenario: You want to know how many patients have a specific condition in your database.
What you can ask:
- "How many patients have influenza?"
- "Count patients with Type 2 Diabetes"
- "How many people are on Metformin?"
What happens:
- MCP discovers the relevant OMOP concepts (e.g., influenza β concept IDs 4171852, 4171853)
- Generates and validates SQL query against your database
- Optionally executes the query and returns results
- Shows estimated query cost (for BigQuery)
Example workflow:
User: "How many patients with flu?"
β Step 1: discover_concepts("flu") β [4171852, 4171853]
β Step 2: query_omop(type="count", concept_ids=[...]) β {"patient_count": 1234}
3. 𧬠Demographic Breakdowns
Scenario: You need demographic analysis of patients with a condition.
What you can ask:
- "Show age and gender distribution of diabetic patients"
- "Break down flu patients by demographics"
- "What's the age distribution of patients on statins?"
What happens:
- MCP finds the relevant concepts
- Generates SQL joining person table for demographics
- Groups by gender and age
- Returns breakdown with patient counts
Example output:
{
"results": [
{"gender_concept_id": 8507, "age_years": 65, "patient_count": 145},
{"gender_concept_id": 8532, "age_years": 58, "patient_count": 132},
...
]
}
4. π Concept Relationship Exploration
Scenario: You need to explore concept hierarchies and mappings.
What you can ask:
- "Show me all child concepts under 'Diabetes Mellitus'"
- "What does ICD-10 E11.9 map to in SNOMED?"
- "Find parent concepts for Metformin 500mg"
What happens:
- MCP fetches relationships from ATHENA
- Filters by relationship type (Maps to, Subsumes, Is a)
- Returns hierarchical concept tree
- Shows vocabulary crosswalks (ICD-10 β SNOMED, etc.)
Use for:
- Building comprehensive concept sets
- Understanding vocabulary mappings
- Creating inclusion/exclusion criteria
5. π Cohort SQL Generation
Scenario: You need to define a research cohort with temporal logic.
What you can ask:
- "Generate SQL for patients on Metformin who developed acute kidney injury within 90 days"
- "Create a cohort of diabetics who had a stroke within 1 year"
- "Find patients with exposure X followed by outcome Y"
What happens:
- MCP uses concept IDs for exposure and outcome
- Generates SQL with temporal constraints
- Includes deduplication logic (first exposure per patient)
- Validates query and estimates cost
- Returns executable SQL for your platform (BigQuery or Postgres)
Example SQL output:
WITH exposure AS (
SELECT DISTINCT person_id, drug_exposure_start_date AS exposure_date
FROM drug_exposure
WHERE drug_concept_id IN (1503297) -- Metformin
),
outcome AS (
SELECT DISTINCT person_id, condition_start_date AS outcome_date
FROM condition_occurrence
WHERE condition_concept_id IN (46271022) -- Acute kidney injury
),
cohort AS (
SELECT e.person_id, e.exposure_date, o.outcome_date,
DATE_DIFF(o.outcome_date, e.exposure_date, DAY) AS days_to_outcome
FROM exposure e
INNER JOIN outcome o ON e.person_id = o.person_id
WHERE e.exposure_date <= o.outcome_date
AND DATE_DIFF(o.outcome_date, e.exposure_date, DAY) <= 90
)
SELECT * FROM cohort
QUALIFY ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY exposure_date) = 1;
6. π Cross-Vocabulary Mapping
Scenario: You have codes from multiple vocabularies and need to standardize them.
What you can ask:
- "Map these ICD-10 codes to SNOMED: E11.9, E10.9, I10"
- "Convert RxNorm codes to OMOP standard drug concepts"
- "What are the standard equivalents for these ICD-9 codes?"
What happens:
- MCP searches each code in its source vocabulary
- Follows "Maps to" relationships to standard concepts
- Returns both source and standard concepts
- Ensures all concepts are ready for cohort queries
Why it matters:
- Electronic Health Records use different coding systems
- OMOP requires standard concepts for queries
- One query captures data from all source vocabularies
7. π Multi-Backend Portability
Scenario: You need the same cohort query for different database platforms.
What you can ask:
- "Generate this cohort query for both BigQuery and Postgres"
- "Show me the cost difference between running this on BigQuery vs Postgres"
What happens:
- MCP generates dialect-specific SQL
- BigQuery version uses UNNEST, QUALIFY, backtick-quoted tables
- Postgres version uses arrays, subqueries, schema.table format
- Both return identical results from OMOP CDM data
Supported backends:
- β BigQuery (full support, cost estimates, dry-run validation)
- β Snowflake (full support, EXPLAIN validation, enterprise ready)
- β DuckDB (full support, local execution, zero setup, free!)
- β Universal SQL Translation (10+ dialects via SQLGlot)
8. π° Cost Estimation & Validation
Scenario: You want to check query cost before running expensive analytics.
What you can ask:
- "How much will it cost to query all diabetes patients with cardiovascular events?"
- "Estimate the cost before running this cohort query"
- "Validate this SQL without executing it"
What happens:
- MCP runs BigQuery dry-run validation
- Returns estimated bytes scanned
- Calculates approximate cost (BigQuery: $5/TB)
- Warns if cost exceeds configured threshold (default: $1)
- Requires confirmation for expensive queries
Safety features:
- π« Blocks queries over cost limit
- π Shows query plan details
- β±οΈ Estimates execution time
- π Prevents accidental expensive runs
9. π‘οΈ Secure, Governed Queries
Scenario: You need enterprise-grade security and audit trails.
What you can ask:
- "Run this query with my team's credentials"
- "Show me the audit log for executed queries"
- "Check if I have permission to list patient IDs"
What happens:
- MCP validates OAuth2.1 bearer token
- Checks user roles and permissions
- Blocks PHI-returning queries in production (e.g., patient lists)
- Logs all queries with user ID, timestamp, cost, results
- Enforces row limits (max 1000 rows)
- Blocks mutating queries (DELETE, UPDATE, DROP)
Security controls:
- π OAuth2.1 authentication
- π₯ Role-based authorization
- π Complete audit trail
- π« Mutation blocking
- π΅ Cost caps
- β° Query timeouts (30s default)
10. π§ͺ Exploratory Data Analysis
Scenario: You're exploring a new OMOP dataset and want to understand what's in it.
What you can ask:
- "What are the top 10 most common conditions in this database?"
- "Show me the drug exposure distribution"
- "What's the age range of patients in the dataset?"
- "How many patients have enrollment data?"
What happens:
- MCP generates exploratory SQL queries
- Runs aggregations across core OMOP tables
- Returns summary statistics
- Helps you understand data completeness
Great for:
- Data quality assessment
- Study feasibility analysis
- Understanding data coverage
- Identifying common vs. rare conditions
β¨ New Features
π€ AI-Powered Agents (PydanticAI)
Natural language interface for OMOP queries with intelligent concept discovery and SQL generation:
from omop_mcp.agents import ConceptDiscoveryAgent, SQLGenerationAgent
# AI-powered concept discovery
agent = ConceptDiscoveryAgent()
result = await agent.run("Find all diabetes medications")
# Returns structured list of drug concepts with confidence scores
# AI-powered SQL generation
sql_agent = SQLGenerationAgent()
cohort_sql = await sql_agent.run(
"Patients on metformin who developed kidney problems within 90 days"
)
# Returns complete, validated cohort SQL
ποΈ Multi-Database Support
Query OMOP CDM across different database platforms with automatic SQL translation:
DuckDB (Local Development):
from omop_mcp.backends import DuckDBBackend
# Zero setup - works immediately!
backend = DuckDBBackend() # In-memory by default
results = await backend.execute_query("SELECT COUNT(*) FROM person")
# Fast, free, local execution - perfect for development!
Snowflake (Enterprise):
from omop_mcp.backends import SnowflakeBackend
# Enterprise cloud data warehouse
backend = SnowflakeBackend()
parts = await backend.build_cohort_sql(
exposure_ids=[1234],
outcome_ids=[5678],
pre_outcome_days=30
)
# Production-ready with EXPLAIN validation
BigQuery (Cloud-Scale):
from omop_mcp.backends import BigQueryBackend
# Google Cloud Platform
backend = BigQueryBackend()
validation = await backend.validate_sql(sql) # Dry-run with cost estimate
π Universal SQL Translation
Translate queries between 10+ SQL dialects automatically:
from omop_mcp.backends import translate_sql
# Translate BigQuery SQL to Snowflake
bigquery_sql = "SELECT DATE_DIFF(end_date, start_date, DAY) FROM visits"
snowflake_sql = translate_sql(bigquery_sql, "bigquery", "snowflake")
# Result: "SELECT DATEDIFF(DAY, start_date, end_date) FROM visits"
# Translate to DuckDB
duckdb_sql = translate_sql(bigquery_sql, "bigquery", "duckdb")
# Result: "SELECT date_diff('day', start_date, end_date) FROM visits"
Supported Dialects:
- BigQuery, Snowflake, DuckDB
- PostgreSQL, MySQL, SQLite
- Redshift, Spark, Trino, Presto
π Data Export Tools
Export OMOP data in standardized formats:
from omop_mcp.tools.export import (
export_concept_set,
export_sql_query,
export_query_results,
export_cohort_definition
)
# Export concept set to CSV
await export_concept_set(
concepts=concept_list,
format="csv",
output_path="diabetes_concepts.csv"
)
# Export SQL query with metadata
await export_sql_query(
sql=cohort_sql,
metadata={"description": "Diabetes cohort", "author": "researcher"},
output_path="cohort_query.json"
)
# Export query results
await export_query_results(
results=query_results,
format="json",
output_path="cohort_results.json",
include_metadata=True
)
Supported Formats: JSON, CSV with automatic type handling
π Quick Start
Prerequisites
- Python 3.11+ (uses modern type hints)
- OMOP CDM database (BigQuery or Postgres)
- ATHENA API access (public, no key required)
Installation
Quick Install with UV Extras
# Local development (DuckDB only)
uv pip install omop-mcp[duckdb]
# Cloud analytics (BigQuery + Snowflake)
uv pip install omop-mcp[cloud]
# All backends
uv pip install omop-mcp[all-backends]
# Development with all tools
uv pip install omop-mcp[dev,all-backends]
Traditional Installation
# Install from PyPI (when published)
pip install omop-mcp
# Or install from source
git clone https://github.com/aandresalvarez/omop-mcp.git
cd omop-mcp
uv sync
Configuration
Copy .env.example to .env and customize for your environment:
cp .env.example .env
Or set environment variables directly:
# Required: Database backend
BACKEND_TYPE=bigquery # or "snowflake" or "duckdb"
# For BigQuery
BIGQUERY_PROJECT_ID=your-gcp-project
BIGQUERY_DATASET_ID=omop_cdm
BIGQUERY_CREDENTIALS_PATH=/path/to/service-account.json
# Alternative: Use Application Default Credentials (ADC)
# BIGQUERY_CREDENTIALS_PATH= # Leave empty to use ADC
#
# ADC Authentication Methods:
# 1. User credentials (development):
# gcloud auth application-default login
#
# 2. Service account via environment variable:
# export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
#
# 3. Metadata service (GCP environments - Cloud Run, Compute Engine, etc.):
# Automatically available - no additional setup needed
#
# 4. Workload Identity (Kubernetes):
# Configured via service account annotations
# For Snowflake
SNOWFLAKE_ACCOUNT=your-account.snowflakecomputing.com
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_DATABASE=omop_db
SNOWFLAKE_SCHEMA=cdm
SNOWFLAKE_WAREHOUSE=compute_wh
# For DuckDB (local/embedded - no credentials needed!)
DUCKDB_DATABASE_PATH=:memory: # or "./omop.duckdb" for persistent
DUCKDB_SCHEMA=main
# Optional: Security
MAX_COST_USD=1.0 # Cost limit for BigQuery queries
MAX_QUERY_TIMEOUT_SEC=30 # Query timeout
PHI_MODE=false # Set true to allow patient_id queries
# Optional: OAuth (for production)
OAUTH_ISSUER=https://your-auth-provider.com
OAUTH_AUDIENCE=omop-mcp-api
Running the Server
Authentication Methods
The OMOP MCP server supports multiple authentication methods for BigQuery access:
Method 1: Service Account (Recommended for Production)
# Download service account key
gcloud iam service-accounts keys create omop-mcp-key.json \
--iam-account=omop-mcp-server@your-project-id.iam.gserviceaccount.com
# Set environment variable
BIGQUERY_CREDENTIALS_PATH=/path/to/omop-mcp-key.json
Method 2: Application Default Credentials (ADC)
# Option 1: User credentials (development)
gcloud auth application-default login
# Option 2: Service account via environment variable
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
# Option 3: Metadata service (GCP environments)
# Automatically available in Cloud Run, Compute Engine, etc.
# Leave credentials path empty to use ADC
BIGQUERY_CREDENTIALS_PATH=
Authentication Priority:
- Service account JSON file (if
BIGQUERY_CREDENTIALS_PATHis set and file exists) - Application Default Credentials (ADC)
GOOGLE_APPLICATION_CREDENTIALSenvironment variable- Metadata service (GCP environments)
- User credentials (
gcloud auth application-default login)
Start the Server
Option 1: As MCP Server (for Claude Desktop, etc.)
Add to your MCP client configuration (e.g., claude_desktop_config.json):
{
"mcpServers": {
"omop": {
"command": "uv",
"args": ["run", "omop-mcp"],
"env": {
"BIGQUERY_PROJECT_ID": "your-project",
"BIGQUERY_DATASET_ID": "omop_cdm",
"BIGQUERY_CREDENTIALS_PATH": "/path/to/credentials.json"
}
}
}
}
Restart Claude Desktop, and you'll see OMOP tools available.
Option 2: Direct Python Usage
import asyncio
from omop_mcp.tools.athena import discover_concepts
from omop_mcp.tools.query import query_by_concepts
async def main():
# Step 1: Discover concepts
result = await discover_concepts(
query="type 2 diabetes",
domain="Condition",
standard_only=True
)
print(f"Found {len(result.concepts)} concepts:")
for concept in result.concepts:
print(f" - {concept.concept_name} ({concept.concept_id})")
# Step 2: Query database
concept_ids = [c.concept_id for c in result.concepts]
query_result = await query_by_concepts(
query_type="count",
concept_ids=concept_ids,
domain="Condition",
backend="bigquery",
execute=True
)
print(f"\nSQL Generated:\n{query_result.sql}")
print(f"\nPatient count: {query_result.results[0]['patient_count']}")
print(f"Estimated cost: ${query_result.estimated_cost_usd:.4f}")
asyncio.run(main())
Option 3: As Standalone Server
# Run MCP server on stdio
uv run python -m omop_mcp.server
# Or with explicit backend
BACKEND_TYPE=postgres uv run python -m omop_mcp.server
π οΈ Available MCP Tools
Core Tools
| Tool | Purpose | Input Parameters | Returns |
|---|---|---|---|
discover_concepts |
Search ATHENA for concepts | query, domain, vocabulary, standard_only, limit |
ConceptDiscoveryResult |
get_concept_relationships |
Explore concept hierarchies | concept_id, relationship_id |
List of ConceptRelationship |
query_omop |
Execute analytical queries | query_type, concept_ids, domain, backend, execute |
QueryOMOPResult |
generate_cohort_sql |
Create temporal cohort queries | exposure_ids, outcome_ids, time_window, dialect |
SQL string |
Direct SQL Tools (New! π)
| Tool | Purpose | Input Parameters | Returns |
|---|---|---|---|
get_information_schema |
Get database schema info | table_name, backend |
Table/column definitions |
select_query |
Execute direct SQL with validation | sql, validate, execute, backend, limit |
Query results + metadata |
Export Tools (New! π)
| Tool | Purpose | Input Parameters | Returns |
|---|---|---|---|
export_concept_set |
Export concepts to JSON/CSV | concepts, format, output_path |
Saved file path |
export_sql_query |
Export SQL with metadata | sql, metadata, output_path |
Saved file path |
export_query_results |
Export results to JSON/CSV | results, format, output_path, include_metadata |
Saved file path |
export_cohort_definition |
Export complete cohort definition | definition, output_path |
Saved file path |
SQL Tools (New! π)
| Tool | Purpose | Input Parameters | Returns |
|---|---|---|---|
translate_sql |
Cross-dialect SQL translation | sql, source_dialect, target_dialect |
Translated SQL |
validate_sql |
Validate SQL syntax | sql, dialect |
Validation result |
format_sql |
Pretty-print SQL | sql, dialect, pretty |
Formatted SQL |
AI Agent Tools (New! π€)
| Tool | Purpose | Input Parameters | Returns |
|---|---|---|---|
concept_discovery_agent |
AI-powered concept search | question, domains |
Structured concept list |
sql_generation_agent |
AI-powered SQL generation | description, exposure, outcome |
Complete cohort SQL |
Resources (Cacheable Data)
| Resource | URI Pattern | Description |
|---|---|---|
| Concept by ID | omop://concept/{id} |
Fetch single concept details |
| Search concepts | athena://search?query={q}&domain={d} |
Paginated concept search |
| Backend capabilities | backend://capabilities |
List available database backends |
Prompts (AI Guidance)
| Prompt | Purpose | Arguments | Output |
|---|---|---|---|
cohort/sql |
Guide SQL generation | exposure, outcome, time_window, dialect |
SQL generation template |
analysis/discovery |
Guide concept discovery | question, domains |
Systematic discovery workflow |
query/multi-step |
Guide query execution | concept_ids, domain |
Cost-aware execution guide |
π Documentation
Core Documentation
- SQL Validation & Security - Comprehensive security features, table allowlists, PHI protection
- Execution & Caching Architecture - Performance optimization, caching strategies, monitoring
- API Reference - Complete tool documentation with examples
Integration Guides
- Claude Desktop Integration - Complete setup guide for Claude Desktop
- LibreChat + Ollama Integration - Local deployment with LibreChat and Ollama
- Generic MCP Client Guide - Integration with any MCP-compatible client
Configuration
- Environment Configuration - Complete configuration reference with examples
- Comprehensive Configuration Guide - Detailed setup for all environments, databases, and AI clients
- UV Extras Installation - Backend-specific installation options
π Security Features
The OMOP MCP server implements comprehensive security measures to protect healthcare data:
SQL Safety Layer
- Only SELECT statements allowed - Blocks all mutating operations (DELETE, UPDATE, DROP, etc.)
- OMOP table allowlist - Restricts access to approved OMOP CDM tables only
- PHI column blocking - Prevents access to sensitive source value columns
- Automatic row limiting - Prevents excessive data retrieval
- Cost validation - BigQuery dry-run validation with cost limits
Configuration Options
# Enable strict table validation
STRICT_TABLE_VALIDATION=true
# Block PHI columns
OMOP_BLOCKED_COLUMNS=person_source_value,provider_source_value
# Set cost limits
MAX_COST_USD=1.0
# Disable patient ID queries in production
ALLOW_PATIENT_LIST=false
Error Types
SecurityViolationError- Dangerous SQL operations detectedTableNotAllowedError- Non-allowlisted table accessedColumnBlockedError- Blocked PHI column accessedCostLimitExceededError- Query cost exceeds limit
See SQL Validation Documentation for complete security details.
π Detailed Examples
Example 1: Basic Concept Discovery
from omop_mcp.tools.athena import discover_concepts
# Search for flu concepts
result = await discover_concepts(
query="influenza",
domain="Condition",
standard_only=True,
limit=10
)
print(f"Found {len(result.concepts)} concepts")
for concept in result.concepts:
print(f"{concept.concept_id}: {concept.concept_name}")
print(f" Domain: {concept.domain_id}, Vocabulary: {concept.vocabulary_id}")
print(f" Standard: {concept.is_standard()}, Valid: {concept.is_valid()}")
Output:
Found 3 concepts
4171852: Influenza
Domain: Condition, Vocabulary: SNOMED
Standard: True, Valid: True
4171853: Influenza due to seasonal influenza virus
Domain: Condition, Vocabulary: SNOMED
Standard: True, Valid: True
Example 2: Patient Count Query
from omop_mcp.tools.query import query_by_concepts
# Count patients with diabetes (concept IDs from discovery)
result = await query_by_concepts(
query_type="count",
concept_ids=[201826, 201254], # Type 2 diabetes concepts
domain="Condition",
backend="bigquery",
execute=False # Dry-run first
)
print(f"SQL: {result.sql}")
print(f"Estimated cost: ${result.estimated_cost_usd:.4f}")
print(f"Estimated bytes: {result.estimated_bytes:,}")
# If cost acceptable, execute
if result.estimated_cost_usd < 0.10:
result = await query_by_concepts(
query_type="count",
concept_ids=[201826, 201254],
domain="Condition",
backend="bigquery",
execute=True # Actually run it
)
print(f"Patient count: {result.results[0]['patient_count']}")
Example 3: Demographic Breakdown
# Get age/gender breakdown for diabetes patients
result = await query_by_concepts(
query_type="breakdown",
concept_ids=[201826],
domain="Condition",
backend="bigquery",
execute=True
)
print("Demographics:")
for row in result.results:
gender = "Male" if row['gender_concept_id'] == 8507 else "Female"
print(f" {gender}, Age {row['age_years']}: {row['patient_count']} patients")
Output:
Demographics:
Male, Age 65: 145 patients
Female, Age 58: 132 patients
Male, Age 72: 98 patients
...
Example 4: Multi-Step Workflow (Discovery β Query)
async def analyze_condition(condition_name: str):
"""Complete workflow: discover concepts and query database."""
# Step 1: Discover concepts
print(f"Discovering concepts for '{condition_name}'...")
discovery = await discover_concepts(
query=condition_name,
domain="Condition",
standard_only=True
)
if not discovery.concepts:
print("No concepts found!")
return
print(f"Found {len(discovery.concepts)} concepts:")
for c in discovery.concepts:
print(f" - {c.concept_name} ({c.concept_id})")
# Step 2: Estimate query cost
concept_ids = [c.concept_id for c in discovery.concepts]
print("\nEstimating query cost...")
estimate = await query_by_concepts(
query_type="count",
concept_ids=concept_ids,
domain="Condition",
backend="bigquery",
execute=False
)
print(f"Estimated cost: ${estimate.estimated_cost_usd:.4f}")
# Step 3: Execute if cost acceptable
if estimate.estimated_cost_usd < 1.0:
print("\nExecuting query...")
result = await query_by_concepts(
query_type="count",
concept_ids=concept_ids,
domain="Condition",
backend="bigquery",
execute=True
)
patient_count = result.results[0]['patient_count']
print(f"β
Found {patient_count:,} patients with {condition_name}")
else:
print("β Query too expensive, skipping execution")
# Run the workflow
await analyze_condition("type 2 diabetes")
Example 5: Cross-Domain Query (Drugs + Conditions)
# Find patients on Metformin who developed acute kidney injury
from omop_mcp.tools.athena import discover_concepts
# Discover drug concept
drug_result = await discover_concepts(query="metformin", domain="Drug")
drug_ids = [c.concept_id for c in drug_result.concepts]
# Discover condition concept
condition_result = await discover_concepts(query="acute kidney injury", domain="Condition")
condition_ids = [c.concept_id for c in condition_result.concepts]
# Query drug exposures
drug_query = await query_by_concepts(
query_type="count",
concept_ids=drug_ids,
domain="Drug",
backend="bigquery",
execute=True
)
# Query condition occurrences
condition_query = await query_by_concepts(
query_type="count",
concept_ids=condition_ids,
domain="Condition",
backend="bigquery",
execute=True
)
print(f"Patients on Metformin: {drug_query.results[0]['patient_count']}")
print(f"Patients with AKI: {condition_query.results[0]['patient_count']}")
Example 6: Using MCP Resources (Caching)
from omop_mcp.resources import get_concept_resource, search_concepts_resource
# Get single concept (cacheable by MCP client)
concept_resource = await get_concept_resource(concept_id=201826)
print(concept_resource) # Returns concept with URI omop://concept/201826
# Search with pagination (cacheable)
page1 = await search_concepts_resource(
query="diabetes",
domain="Condition",
cursor=None, # First page
page_size=50
)
print(f"Found {len(page1['concepts'])} concepts")
print(f"Next cursor: {page1['next_cursor']}")
# Get next page using cursor
page2 = await search_concepts_resource(
query="diabetes",
domain="Condition",
cursor=page1['next_cursor'],
page_size=50
)
Example 7: Using MCP Prompts (AI Guidance)
from omop_mcp.prompts import get_prompt
# Get SQL generation guidance
prompt = await get_prompt(
prompt_id="cohort/sql",
arguments={
"exposure": "Metformin",
"outcome": "Acute Kidney Injury",
"time_window": "90 days",
"dialect": "bigquery"
}
)
print(prompt["messages"][0]["content"]["text"])
# Returns detailed prompt with SQL template, best practices, and examples
Example 8: Local Development with DuckDB (New! π)
from omop_mcp.backends import DuckDBBackend, translate_query
# Step 1: Develop and test locally with DuckDB (FREE!)
duckdb_backend = DuckDBBackend() # Zero setup required!
# Build cohort SQL
parts = await duckdb_backend.build_cohort_sql(
exposure_ids=[1503297], # Metformin
outcome_ids=[46271022], # Acute kidney injury
pre_outcome_days=90
)
# Test locally (instant, free)
local_results = await duckdb_backend.execute_query(
parts.to_sql(),
limit=10
)
print(f"β
Found {len(local_results)} matching records locally")
# Step 2: Translate to production database
bigquery_sql = translate_query(parts.to_sql(), "duckdb", "bigquery")
snowflake_sql = translate_query(parts.to_sql(), "duckdb", "snowflake")
# Step 3: Run on production (after local validation)
from omop_mcp.backends import BigQueryBackend
bigquery_backend = BigQueryBackend()
validation = await bigquery_backend.validate_sql(bigquery_sql)
print(f"π° Estimated cost: ${validation.estimated_cost_usd:.2f}")
if validation.estimated_cost_usd < 1.0:
prod_results = await bigquery_backend.execute_query(bigquery_sql)
print(f"π Production results: {len(prod_results)} records")
Why this workflow?
- π Free local testing - No cloud costs during development
- β‘ Instant iteration - Test changes in milliseconds
- β Validate before deploy - Catch errors locally
- π° Cost-conscious - Only pay for production queries
- π Cross-platform - Same SQL works on BigQuery, Snowflake, DuckDB
Example 9: AI-Powered Concept Discovery (New! π€)
from omop_mcp.agents import ConceptDiscoveryAgent
# Initialize AI agent
agent = ConceptDiscoveryAgent()
# Natural language concept search
result = await agent.run(
"Find all concepts related to type 2 diabetes and its complications"
)
print(f"Found {len(result.concepts)} concepts:")
for concept in result.concepts[:5]:
print(f" - {concept.concept_name} ({concept.concept_id})")
print(f" Domain: {concept.domain_id}, Confidence: {concept.confidence}")
# Agent automatically:
# - Understands medical context
# - Searches multiple domains
# - Filters for relevance
# - Returns structured results
Example 10: AI-Powered SQL Generation (New! π€)
from omop_mcp.agents import SQLGenerationAgent
# Initialize SQL agent
agent = SQLGenerationAgent()
# Generate cohort SQL from natural language
result = await agent.run(
description="Patients on metformin who developed acute kidney injury within 90 days",
exposure="metformin",
outcome="acute kidney injury"
)
print("Generated SQL:")
print(result.sql)
print(f"\nExposure concepts: {result.exposure_ids}")
print(f"Outcome concepts: {result.outcome_ids}")
print(f"Validation: {'β
Valid' if result.validation.valid else 'β Invalid'}")
# Agent automatically:
# - Discovers relevant concepts
# - Generates cohort SQL
# - Validates syntax
# - Returns complete, executable query
π Example Workflow
Research Question: "How many patients developed acute kidney injury after starting Metformin?"
1. User asks the question
β
2. MCP discovers concepts:
- Metformin β 1503297 (Drug)
- Acute kidney injury β 46271022 (Condition)
β
3. MCP generates cohort SQL with 90-day temporal window
β
4. MCP validates query (estimated cost: $0.08)
β
5. User approves execution
β
6. MCP returns results:
- 1,234 patients
- Median time to event: 45 days
- SQL available for reproduction
π Who Should Use This?
- Clinical Researchers: Build cohorts faster with natural language
- Data Scientists: Generate validated SQL without memorizing OMOP schema
- Healthcare Analysts: Explore OMOP datasets interactively
- Informaticists: Map clinical terminologies automatically
- Study Coordinators: Assess feasibility with quick patient counts
- AI Developers: Integrate OMOP capabilities into health AI applications
π§ͺ Testing
Run the comprehensive test suite:
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=omop_mcp --cov-report=html
# Run specific test categories
uv run pytest tests/test_integration.py # E2E workflows
uv run pytest tests/test_athena.py # ATHENA API
uv run pytest tests/test_query_security.py # Security guards
# Run with verbose output
uv run pytest -v
Test Coverage: 173 tests, 100% passing β
- Unit tests: 161 (models, backends, tools, resources, prompts, agents, export, sqlgen, dialect)
- Integration tests: 12 (E2E discoverβquery workflows, multi-backend, cross-dialect translation)
- Coverage areas:
- β Core OMOP tools (athena, query, sqlgen)
- β AI agents (concept discovery, SQL generation)
- β Export tools (JSON, CSV, all data types)
- β Multi-backend (BigQuery, Snowflake, DuckDB)
- β SQL translation (10+ dialects)
- β Security & validation
π¨βπ» Development
Quick Start
# Clone and setup
git clone https://github.com/aandresalvarez/omop-mcp.git
cd omop-mcp
# Install with all dev dependencies
uv sync --extra dev
# Run quality checks
make check # Format, lint, typecheck, test
make check-all # All checks + pylint + pyright + security
Quality Tools
This project uses comprehensive quality infrastructure for healthcare-grade code:
Code Quality
make format # Black + ruff auto-formatting
make lint # Ruff linting
make pylint # Strict linting
make typecheck # mypy type checking
make pyright # pyright type checking
Security Scanning π
make security # bandit (Python security) + pip-audit (vulnerabilities)
make audit # Comprehensive: security + safety dependency scan
SQL Quality (Critical for OMOP queries)
make sql-lint # Lint SQL files with sqlfluff
make sql-fix # Auto-fix SQL formatting
Testing & Coverage
make test # Run pytest
make coverage # Detailed coverage report (HTML + JSON + terminal)
Available Make Targets
make help # Show all available commands
# Setup
make dev # Full development setup (venv + deps + tools)
# Quality checks
make check # Standard checks (format, lint, typecheck, test)
make check-all # ALL checks including security
make pre-commit # Run pre-commit hooks
# Security
make security # Security scans (bandit + pip-audit)
make audit # Full security audit (+ safety)
# SQL
make sql-lint # Lint SQL with sqlfluff
make sql-fix # Auto-format SQL
# Testing
make test # Run tests
make coverage # Generate coverage reports
# Server
make http # Run MCP server (HTTP mode)
make stdio # Run MCP server (stdio mode)
CI/CD
GitHub Actions automatically runs on every push:
- β Tests (Python 3.11, 3.12)
- β Code quality (format, lint)
- β Type checking (mypy, pyright)
- β Security scanning (bandit, pip-audit, safety)
- β SQL quality (sqlfluff)
- β Coverage reporting
See .github/workflows/quality.yml for details.
Pre-commit Hooks
Install git hooks for automatic quality checks:
make pre-commit-install
This runs formatting, linting, and type checking before each commit.
π§ Troubleshooting
Common Issues
Problem: "Backend not found: bigquery"
# Solution: Install backend dependencies
uv pip install google-cloud-bigquery
# Or use DuckDB for local development (no setup required!)
export BACKEND_TYPE=duckdb
Problem: "ATHENA API timeout"
# Solution: The public ATHENA API can be slow. Increase timeout:
export ATHENA_TIMEOUT_SEC=60
# Or use cached results from MCP resources
Problem: "Query exceeds cost limit"
# Solution: Increase cost cap or optimize query
export MAX_COST_USD=5.0
# Or run with execute=False to see SQL first
Problem: "concept_ids cannot be empty"
# Solution: Discovery returned no results. Try broader search:
result = await discover_concepts(
query="diabetes", # Broader term
standard_only=False # Include non-standard
)
Problem: "OAuth token invalid"
# Solution: Check token format and issuer
# Token must be Bearer JWT with correct audience
export OAUTH_AUDIENCE=omop-mcp-api
Debugging Tips
Enable debug logging:
import structlog
structlog.configure(
wrapper_class=structlog.make_filtering_bound_logger(logging.DEBUG)
)
Check backend connectivity:
from omop_mcp.backends.registry import list_backends, get_backend, get_supported_dialects
# List all registered backends
backends = list_backends()
print(f"Available backends: {backends}")
# Check supported SQL dialects
dialects = get_supported_dialects()
print(f"Supported dialects: {dialects}")
# Get specific backend
backend = get_backend("duckdb") # or "bigquery" or "snowflake"
print(f"Connected to: {backend.name} (dialect: {backend.dialect})")
Test SQL translation:
from omop_mcp.backends import translate_sql, validate_sql
# Translate SQL between dialects
bigquery_sql = "SELECT DATE_DIFF(end_date, start_date, DAY) FROM visits"
snowflake_sql = translate_sql(bigquery_sql, "bigquery", "snowflake")
print(f"Translated SQL: {snowflake_sql}")
# Validate SQL for specific dialect
is_valid, error = validate_sql(snowflake_sql, "snowflake")
print(f"Valid: {is_valid}, Error: {error}")
Validate SQL without execution:
result = await query_by_concepts(
query_type="count",
concept_ids=[201826],
domain="Condition",
backend="bigquery",
execute=False # SQL only, no execution
)
print(result.sql)
print(f"Cost: ${result.estimated_cost_usd}")
π Performance Tips
General Tips
- Use MCP Resources for caching: Resources are cached by MCP clients
- Batch concept lookups: Search once, query multiple times
- Start with execute=False: Validate SQL and cost before running
- Use standard_only=True: Reduces search result size
- Set appropriate limits: Default is 50 concepts, increase if needed
Backend-Specific Tips
DuckDB (Local Development):
- β
Instant startup: In-memory mode (
:memory:) is fastest - β Free testing: No cloud costs, iterate rapidly
- β
File-based persistence: Use
./omop.duckdbfor persistent storage - β Import Parquet: DuckDB can query Parquet files directly
- β‘ Performance: ~1-10GB datasets run in seconds
BigQuery (Cloud-Scale):
- π° Enable query result caching: Results cached for 24 hours (free!)
- π Partition tables: Use partitioned OMOP tables for cost reduction
- π Use dry-run first: Check cost before execution
- π΅ Monitor costs: Set
MAX_COST_USDto prevent expensive queries - β‘ Performance: Scales to petabytes
Snowflake (Enterprise):
- βοΈ Use warehouse sizes appropriately: Start with X-Small for dev
- π Enable result cache: Snowflake caches identical queries
- π Scale compute: Adjust warehouse size based on query complexity
- π° Suspend warehouses: Auto-suspend after 5 minutes of inactivity
- β‘ Performance: Excellent for complex analytics
Local β Production Workflow
# 1. Develop locally with DuckDB (free, fast)
duckdb_backend = DuckDBBackend()
local_results = await duckdb_backend.execute_query(sql, limit=10)
# 2. Translate to production dialect
prod_sql = translate_sql(sql, "duckdb", "bigquery")
# 3. Validate cost before production run
bigquery_backend = BigQueryBackend()
validation = await bigquery_backend.validate_sql(prod_sql)
print(f"Cost: ${validation.estimated_cost_usd:.2f}")
# 4. Execute on production if cost acceptable
if validation.estimated_cost_usd < 1.0:
prod_results = await bigquery_backend.execute_query(prod_sql)
This workflow saves money and time by validating locally first!
π Security Best Practices
- Use OAuth in production: Enable
OAUTH_ISSUERandOAUTH_AUDIENCE - Set cost limits: Default is $1, adjust based on your budget
- Disable PHI mode: Set
PHI_MODE=falseto block patient ID queries - Use service accounts: For BigQuery, use dedicated service account with read-only access
- Prefer ADC in cloud environments: Use Application Default Credentials for Cloud Run, Compute Engine, etc.
- Enable audit logging: All queries are logged with structlog
- Set query timeouts: Default 30s, adjust as needed
π Resources
Official Documentation
Project Documentation
Feature Documentation
- SQL Generation Module
- PydanticAI Agents
- Export Tools
- Additional Backends
- Optional Enhancements Summary
Agent Documentation
π License
MIT License - see LICENSE for details.
π€ Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
π‘ Need Help?
- π Documentation
- π¬ Discussions
- π Issues
Built with β€οΈ for the OHDSI community
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.