CDISC SDTM Validator MCP

CDISC SDTM Validator MCP

Validates CDISC SDTM datasets against SDTMIG 3.4 specifications, offering tools for required variables, controlled terminology, and full dataset validation with bundled sample datasets.

Category
Visit Server

README

CDISC SDTM Validator MCP

A Model Context Protocol (MCP) server for validating CDISC SDTM datasets against SDTMIG 3.4 specifications.

Overview

This MCP server provides AI agents and clinical programmers with tools to validate Study Data Tabulation Model (SDTM) datasets. It demonstrates a complete end-to-end pipeline: raw pharmaceutical data → SDTM transformation → validation.

Tools

1. check_required_variables(columns, domain="DM")

Validates that a dataset contains the three universal SDTM identifier variables required in every domain:

  • STUDYID — Study identifier
  • DOMAIN — Domain abbreviation (e.g., "DM")
  • USUBJID — Unique subject identifier

Returns: {"domain", "required", "missing", "ok"}

2. check_dm_required_variables(columns)

Validates that the Demographics (DM) domain contains all required variables per SDTMIG 3.4 Table 3-1:

  • Universal (3): STUDYID, DOMAIN, USUBJID
  • DM-specific (12): SUBJID, AGE, AGEU, SEX, RACE, ETHNIC, COUNTRY, ARMCD, ARM, ACTARMCD, ACTARM, RFSTDTC

Returns: {"required", "missing", "ok"}

3. check_controlled_terminology(column, values)

Validates that values in a column conform to CDISC Controlled Terminology (CT) codelists.

Supported variables:

  • SEX (C66731): F, M, U, UNDIFFERENTIATED
  • ETHNIC (C66790): HISPANIC OR LATINO, NOT HISPANIC OR LATINO, NOT REPORTED, UNKNOWN
  • RACE (C74457): WHITE, BLACK OR AFRICAN AMERICAN, ASIAN, AMERICAN INDIAN OR ALASKA NATIVE, NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER, MULTIPLE, NOT REPORTED, UNKNOWN
  • AGEU (C66781): YEARS, MONTHS, WEEKS, DAYS, HOURS
  • DTHFL: Y (only valid value for death flag; null/absent means no death)

Returns: {"column", "codelist_id", "valid_values", "invalid", "ok"}

4. validate_dataset(dataset_name=None, dataset=None)

Runs the full validation suite in a single call and returns a combined report. Provide either a bundled sample name (see list_sample_datasets) or an inline Dataset JSON object. This is the recommended entry point for agents — it loads the data, runs all three checks (controlled terminology only for the codelist columns present), and summarizes pass/fail.

Returns: {"dataset", "checks": [...], "summary": {"ok", "passed", "failed"}}

5. list_sample_datasets()

Lists the sample datasets bundled with the server (read live from samples/). Each entry is {"name", "label", "study", "records", "columns", "description"}; use the name with validate_dataset or the sample://<name> resource.

Resources

Each bundled sample is also exposed as an MCP resource at sample://<name> (e.g. sample://pharmaverse_dm), so MCP clients can discover and load the raw Dataset JSON through the native resource primitive.

End-to-End Demo: pharmaverseraw → sdtm.oak → MCP Validation

Background

The pharmaverse ecosystem provides industry-standard tools and data for learning and teaching SDTM. The pharmaverseraw R package contains the CDISCPILOT01 study — a realistic clinical trial dataset in pre-SDTM "raw EDC" format. The sdtm.oak package transforms this raw data into a valid SDTM dataset.

This MCP server completes the pipeline by validating the output:

pharmaverseraw (raw EDC data)
    ↓  sdtm.oak transformation
SDTM DM domain
    ↓  MCP validation
Validation report

Sample Data: CDISCPILOT01

samples/pharmaverse_dm.json contains 5 subjects from the CDISCPILOT01 study in Dataset JSON format (CDISC's standard JSON data representation). This is real, realistic data used by clinical programmers to learn SDTM transformation workflows.

Subject demographics:

  • Age range: 63–77 years
  • Treatment arms: PLACEBO, Xanomeline High Dose, Xanomeline Low Dose
  • Countries: USA, Japan, Germany
  • One subject with a recorded death (DTHFL = "Y")

Running the Demo

The server owns the whole pipeline. With the server running (see below), validate a bundled sample with one MCP call:

# List the available samples
curl -s localhost:8000/samples

# Fetch one sample's raw Dataset JSON
curl -s localhost:8000/samples/pharmaverse_dm

# Run the full validation suite via the validate_dataset tool
curl -s localhost:8000/mcp \
  -H "Content-Type: application/json" -H "Accept: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call",
       "params":{"name":"validate_dataset","arguments":{"dataset_name":"pharmaverse_dm"}}}'

The validate_dataset report includes a summary ({"ok", "passed", "failed"}) plus a per-check breakdown. Try dataset_name: "dm_missing_studyid" to see check_required_variables fail on the missing STUDYID identifier.

The interactive landing page at / does the same thing visually: it lists samples from /samples and renders the validate_dataset report.

Running the Server

Local Development

# Install dependencies
pip install -r requirements.txt

# Start the development server
uvicorn cdisc-mcp:app --reload --port 8000

# Landing page with interactive tool testing
open http://localhost:8000/

# MCP endpoint (for AI agents / clients)
http://localhost:8000/mcp

Configuration

Configuration is optional. One environment variable is recognized (set it in your shell or a local .env file):

  • CONNECT_SERVER — Restrict incoming connections to a specific Posit Connect hostname (via TrustedHostMiddleware). Leave unset for local development.

Sample Datasets

The server reads sample datasets from samples/ on disk and serves them via /samples, the sample:// resources, and validate_dataset. Edit a file in samples/ and the change is reflected immediately — there is no copy baked into the front-end.

  • pharmaverse_dm.json — Realistic CDISCPILOT01 data (all 24 DM columns)
  • dm.json — Hand-crafted valid DM domain (5 subjects, common variables)
  • dm_missing_studyid.json — Test case: missing STUDYID (validates error detection)

Add another sample by dropping a Dataset JSON file into samples/; it appears automatically in the listing, the dropdown, and as a sample://<name> resource. (Optionally add a one-line description to SAMPLE_DESCRIPTIONS in cdisc-mcp.py.)

Dataset JSON Format

Datasets are represented in CDISC Dataset JSON 1.1.0 format. Example structure:

{
  "studyOID": "CDISCPILOT01",
  "name": "DM",
  "label": "Demographics",
  "columns": [
    {"name": "STUDYID", "label": "Study Identifier", ...},
    {"name": "DOMAIN", "label": "Domain Abbreviation", ...},
    ...
  ],
  "rows": [
    ["CDISCPILOT01", "DM", "01-701-1015", ...],
    ...
  ]
}

See the CDISC Dataset JSON specification for full details.

Architecture

All work happens in the server. cdisc-mcp.py reads the sample data from disk, owns the validation orchestration (validate_dataset), and serves both the MCP endpoint and the landing page. The front-end is a thin viewer with no data of its own.

Runtime files (everything the deployed app needs):

  • cdisc-mcp.py — FastMCP server: validation tools, validate_dataset orchestration, /samples routes, and sample:// resources; the deployment entrypoint
  • landing.html — Interactive landing page; fetches samples and results from the server at runtime
  • samples/ — Sample datasets in Dataset JSON format; read live by the server (a single source of truth)
  • requirements.txt — Python dependencies, used by Connect to build the environment

Key Design:

  • Single source of truth for data: samples/*.json on disk, read on every request — editing a sample is reflected everywhere immediately
  • Stateless HTTP service (scales on Posit Connect)
  • Hardcoded CT codelists (no external API calls — checks run fully offline)
  • Deployment-agnostic front-end (MCP and /samples URLs derived client-side from the page location)
  • Validation tools registered as plain callables, so validate_dataset and other Python code can call them directly (not over HTTP)

Deployment to Posit Connect

The server can be deployed to Posit Connect with Posit Publisher, which generates the .posit/ deployment metadata for your environment. Once deployed:

  • Set CONNECT_SERVER to the Connect hostname so TrustedHostMiddleware scopes incoming connections.
  • The server is stateless, so it scales horizontally without shared state.

The deployed bundle needs all runtime files — cdisc-mcp.py, landing.html, requirements.txt, and the samples/ directory (the server reads it at runtime). Make sure the files list in the Posit Publisher config includes samples/.

Future Expansion

Possible directions for richer validation:

  • Full CDISC Conformance Rules (CORE) validation
  • Completeness checks beyond required variables
  • Domain-specific business rule validation
  • Version-specific SDTMIG checking

References

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured