CDISC SDTM Validator MCP
Validates CDISC SDTM datasets against SDTMIG 3.4 specifications, offering tools for required variables, controlled terminology, and full dataset validation with bundled sample datasets.
README
CDISC SDTM Validator MCP
A Model Context Protocol (MCP) server for validating CDISC SDTM datasets against SDTMIG 3.4 specifications.
Overview
This MCP server provides AI agents and clinical programmers with tools to validate Study Data Tabulation Model (SDTM) datasets. It demonstrates a complete end-to-end pipeline: raw pharmaceutical data → SDTM transformation → validation.
Tools
1. check_required_variables(columns, domain="DM")
Validates that a dataset contains the three universal SDTM identifier variables required in every domain:
- STUDYID — Study identifier
- DOMAIN — Domain abbreviation (e.g., "DM")
- USUBJID — Unique subject identifier
Returns: {"domain", "required", "missing", "ok"}
2. check_dm_required_variables(columns)
Validates that the Demographics (DM) domain contains all required variables per SDTMIG 3.4 Table 3-1:
- Universal (3): STUDYID, DOMAIN, USUBJID
- DM-specific (12): SUBJID, AGE, AGEU, SEX, RACE, ETHNIC, COUNTRY, ARMCD, ARM, ACTARMCD, ACTARM, RFSTDTC
Returns: {"required", "missing", "ok"}
3. check_controlled_terminology(column, values)
Validates that values in a column conform to CDISC Controlled Terminology (CT) codelists.
Supported variables:
- SEX (C66731): F, M, U, UNDIFFERENTIATED
- ETHNIC (C66790): HISPANIC OR LATINO, NOT HISPANIC OR LATINO, NOT REPORTED, UNKNOWN
- RACE (C74457): WHITE, BLACK OR AFRICAN AMERICAN, ASIAN, AMERICAN INDIAN OR ALASKA NATIVE, NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER, MULTIPLE, NOT REPORTED, UNKNOWN
- AGEU (C66781): YEARS, MONTHS, WEEKS, DAYS, HOURS
- DTHFL: Y (only valid value for death flag; null/absent means no death)
Returns: {"column", "codelist_id", "valid_values", "invalid", "ok"}
4. validate_dataset(dataset_name=None, dataset=None)
Runs the full validation suite in a single call and returns a combined report. Provide either a bundled sample name (see list_sample_datasets) or an inline Dataset JSON object. This is the recommended entry point for agents — it loads the data, runs all three checks (controlled terminology only for the codelist columns present), and summarizes pass/fail.
Returns: {"dataset", "checks": [...], "summary": {"ok", "passed", "failed"}}
5. list_sample_datasets()
Lists the sample datasets bundled with the server (read live from samples/). Each entry is {"name", "label", "study", "records", "columns", "description"}; use the name with validate_dataset or the sample://<name> resource.
Resources
Each bundled sample is also exposed as an MCP resource at sample://<name> (e.g. sample://pharmaverse_dm), so MCP clients can discover and load the raw Dataset JSON through the native resource primitive.
End-to-End Demo: pharmaverseraw → sdtm.oak → MCP Validation
Background
The pharmaverse ecosystem provides industry-standard tools and data for learning and teaching SDTM. The pharmaverseraw R package contains the CDISCPILOT01 study — a realistic clinical trial dataset in pre-SDTM "raw EDC" format. The sdtm.oak package transforms this raw data into a valid SDTM dataset.
This MCP server completes the pipeline by validating the output:
pharmaverseraw (raw EDC data)
↓ sdtm.oak transformation
SDTM DM domain
↓ MCP validation
Validation report
Sample Data: CDISCPILOT01
samples/pharmaverse_dm.json contains 5 subjects from the CDISCPILOT01 study in Dataset JSON format (CDISC's standard JSON data representation). This is real, realistic data used by clinical programmers to learn SDTM transformation workflows.
Subject demographics:
- Age range: 63–77 years
- Treatment arms: PLACEBO, Xanomeline High Dose, Xanomeline Low Dose
- Countries: USA, Japan, Germany
- One subject with a recorded death (DTHFL = "Y")
Running the Demo
The server owns the whole pipeline. With the server running (see below), validate a bundled sample with one MCP call:
# List the available samples
curl -s localhost:8000/samples
# Fetch one sample's raw Dataset JSON
curl -s localhost:8000/samples/pharmaverse_dm
# Run the full validation suite via the validate_dataset tool
curl -s localhost:8000/mcp \
-H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"validate_dataset","arguments":{"dataset_name":"pharmaverse_dm"}}}'
The validate_dataset report includes a summary ({"ok", "passed", "failed"}) plus a per-check breakdown. Try dataset_name: "dm_missing_studyid" to see check_required_variables fail on the missing STUDYID identifier.
The interactive landing page at / does the same thing visually: it lists samples from /samples and renders the validate_dataset report.
Running the Server
Local Development
# Install dependencies
pip install -r requirements.txt
# Start the development server
uvicorn cdisc-mcp:app --reload --port 8000
# Landing page with interactive tool testing
open http://localhost:8000/
# MCP endpoint (for AI agents / clients)
http://localhost:8000/mcp
Configuration
Configuration is optional. One environment variable is recognized (set it in your shell or a local .env file):
CONNECT_SERVER— Restrict incoming connections to a specific Posit Connect hostname (viaTrustedHostMiddleware). Leave unset for local development.
Sample Datasets
The server reads sample datasets from samples/ on disk and serves them via /samples, the sample:// resources, and validate_dataset. Edit a file in samples/ and the change is reflected immediately — there is no copy baked into the front-end.
pharmaverse_dm.json— Realistic CDISCPILOT01 data (all 24 DM columns)dm.json— Hand-crafted valid DM domain (5 subjects, common variables)dm_missing_studyid.json— Test case: missing STUDYID (validates error detection)
Add another sample by dropping a Dataset JSON file into samples/; it appears automatically in the listing, the dropdown, and as a sample://<name> resource. (Optionally add a one-line description to SAMPLE_DESCRIPTIONS in cdisc-mcp.py.)
Dataset JSON Format
Datasets are represented in CDISC Dataset JSON 1.1.0 format. Example structure:
{
"studyOID": "CDISCPILOT01",
"name": "DM",
"label": "Demographics",
"columns": [
{"name": "STUDYID", "label": "Study Identifier", ...},
{"name": "DOMAIN", "label": "Domain Abbreviation", ...},
...
],
"rows": [
["CDISCPILOT01", "DM", "01-701-1015", ...],
...
]
}
See the CDISC Dataset JSON specification for full details.
Architecture
All work happens in the server. cdisc-mcp.py reads the sample data from disk, owns the validation orchestration (validate_dataset), and serves both the MCP endpoint and the landing page. The front-end is a thin viewer with no data of its own.
Runtime files (everything the deployed app needs):
cdisc-mcp.py— FastMCP server: validation tools,validate_datasetorchestration,/samplesroutes, andsample://resources; the deployment entrypointlanding.html— Interactive landing page; fetches samples and results from the server at runtimesamples/— Sample datasets in Dataset JSON format; read live by the server (a single source of truth)requirements.txt— Python dependencies, used by Connect to build the environment
Key Design:
- Single source of truth for data:
samples/*.jsonon disk, read on every request — editing a sample is reflected everywhere immediately - Stateless HTTP service (scales on Posit Connect)
- Hardcoded CT codelists (no external API calls — checks run fully offline)
- Deployment-agnostic front-end (MCP and
/samplesURLs derived client-side from the page location) - Validation tools registered as plain callables, so
validate_datasetand other Python code can call them directly (not over HTTP)
Deployment to Posit Connect
The server can be deployed to Posit Connect with Posit Publisher, which generates the .posit/ deployment metadata for your environment. Once deployed:
- Set
CONNECT_SERVERto the Connect hostname soTrustedHostMiddlewarescopes incoming connections. - The server is stateless, so it scales horizontally without shared state.
The deployed bundle needs all runtime files — cdisc-mcp.py, landing.html, requirements.txt, and the samples/ directory (the server reads it at runtime). Make sure the files list in the Posit Publisher config includes samples/.
Future Expansion
Possible directions for richer validation:
- Full CDISC Conformance Rules (CORE) validation
- Completeness checks beyond required variables
- Domain-specific business rule validation
- Version-specific SDTMIG checking
References
- CDISC SDTM Implementation Guide (SDTMIG) 3.4
- pharmaverse — CDISC-compliant R tools for clinical data
- CDISC Dataset JSON
- Model Context Protocol (MCP)
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.