edamcp

edamcp

An MCP server for Exploratory Data Analysis that automatically profiles datasets, detects quality issues, PII, bias, and leakage, and provides cleaning and export workflows. It helps data scientists quickly understand and prepare unknown data for modeling.

Category
Visit Server

README

edamcp

A general-purpose Exploratory Data Analysis MCP server. Point it at an unknown dataset and it will tell you what's really in there, what to fix first, and how to make it model-ready — quality issues, leakage, bias, PII, missingness patterns, modeling blockers, and a clean export. The hard analytical reasoning lives in the MCP, not in the agent driving it.

Works with Claude Code, Claude Desktop, Cursor, LM Studio, Open WebUI, and any other MCP-capable client. Designed to be fully usable from a local 30B-class model (Qwen 27B / Mistral Small 3.2 / GLM-4 32B) and validated end-to-end on 11M-row production data.


Quick start

# clone + run from a checkout
git clone https://github.com/charliecpeterson/edamcp.git
cd edamcp
uv run edamcp                  # full surface (70 tools)
uv run edamcp --mode local     # thin surface (35 tools) for local models

Wire it into your MCP client. The exact path to mcp.json varies — Claude Desktop is ~/Library/Application Support/Claude/claude_desktop_config.json, LM Studio is ~/.lmstudio/mcp.json, Cursor is in Settings → MCP. The entry itself is the same shape:

{
  "mcpServers": {
    "edamcp": {
      "command": "/path/to/uv",
      "args": [
        "run", "--directory", "/path/to/edamcp/checkout",
        "edamcp"
      ]
    }
  }
}

For local-model setups (LM Studio + Qwen, Ollama with the bridge, etc.) add "--mode", "local" to the args — see Local mode.


What you can do with it

A typical agent session — one user message, one or two tool calls per question, paste-ready Markdown back:

"Use edamcp on /data/orders.csv. Tell me the three things I should fix first and whether I can train a model predicting is_returned."

→ Agent calls auto_explore then auto_modeling_audit. edamcp returns ranked critical issues, suggested cleaning steps, modeling blockers (class imbalance, leakage, multicollinearity), and pre-filled SQL drill-downs.

"This is taxi data, 11 million rows of dirty parquet. Find the worst problems, propose a cleaning plan, apply it, save the result, and tell me what would leak if I trained a tip-prediction model."

→ One agent run, ~15 tool calls, ends with a 7.8M-row cleaned parquet on disk and a flagged target-leakage warning (total_amount = fare + tip with VIF 1,164). Both the macro work and the heavy modeling audit auto-sample to ~500K rows so the 30-second MCP transport never times out.

What edamcp surfaces

Everything below is detected automatically, with no specific instruction from the agent:

  • Schema quality — constants, all-null columns, sentinel values (-999, "NA", etc.) masquerading as data, type drift across rows.
  • Missingness patterns — including structurally-coupled nulls (e.g. five columns all NULL in the exact same rows).
  • Distribution shape — skew, multimodality, zero-inflation, heavy tails; with suggested transforms (log, Yeo-Johnson, reflect+log).
  • Correlations, multicollinearity (VIF), correlated-feature groups.
  • Duplicates — exact + by-key, with dedupe SQL.
  • Temporal — gaps, monotonicity, drift, daily-count seasonality.
  • Target leakage — name heuristics + correlation + temporal-constant detection.
  • Bias & fairness — class imbalance, sampling bias, 80%-rule disparate impact.
  • PII — regex + Luhn-validated card numbers + column-name heuristics; samples returned redacted.
  • Bootstrap stability + signal-to-noise per feature.
  • PCA dimensionality + Hopkins clustering tendency.
  • Text columns — length, vocab, near-duplicate %, mojibake.
  • Geospatial — lat/lon validation (out-of-range, null island, bbox).
  • Nested JSON — flattens deep structures, leaf-presence %, type drift.
  • HDF5 scientific arrays — leaf-name aggregation, fill values, NaN/Inf, dtype drift, valid_range violations. Validated on real ANI-1 dataset (3.5 GB, 47K molecule groups).
  • Cross-source schema diff + KS-test / PSI distribution drift.

Plus a full cleaning + export workflow (auto_cleanclean_pipelineexport_source), eight plot tools with inline image rendering, an interactive HTML report, and YAML plugins for domain-specific checks (e.g. compliance rules, chemistry valence) that load on startup.


Tool surface (70 tools, organized into eight groups)

Source management · load_source · list_sources · describe_source · detect_pattern · detect_metadata · fingerprint_source · infer_recipes · unload_source · list_files

Query · run_sql · sample_rows · peek_array (HDF5 slice reader)

EDA checks · profile · check_quality · check_distributions · check_correlations · check_duplicates · check_multicollinearity · check_temporal · check_leakage · check_bias · check_pii · check_dimensionality · check_stability · check_text_columns · check_outliers · check_feature_signal · check_arrays · check_geospatial · check_nested_structure · check_custom (plugin runner) · suggest_plots · recommend_tasks

Orchestration · run_eda (presets: quick / standard / deep / exhaustive) · get_eda_findings

Visualization · plot_distribution · plot_correlation_heatmap · plot_scatter · plot_timeseries · plot_boxplot · plot_missingness · plot_pair · plot_qq · plot_violin · plot_facet · eda_storyboard (7-plot guided tour) · generate_report (interactive HTML)

Comparison · compare_sources · compare_groups · detect_schema_evolution

Synthesis · data_card · summarize_run · recommend_next

Cleaning + export · auto_clean (proposes plan) · clean_pipeline (executes + materializes) · clean_drop_columns · clean_rename_columns · clean_cast · clean_replace · clean_filter · clean_drop_duplicates · clean_impute · clean_transform · export_source (parquet / csv / json / ndjson)

Macros (one-call workflows) · auto_explore · auto_quality · auto_modeling_audit · auto_share_check · auto_compare


Design highlights

  • DuckDB views as the substrate. Every loaded source is a DuckDB view in a single in-memory connection. Cross-format JOINs (CSV ↔ Parquet ↔ HDF5 metadata) work for free; large data is out-of-core; no copies until export.
  • Materialized clean output. clean_pipeline collapses its final view into a real BASE TABLE (CTAS) so downstream queries scan it once instead of re-running the whole filter chain — 10× speedup on every follow-up query.
  • Scale-aware sampling. Heavy tools (auto_modeling_audit, suggest_plots, eda_storyboard, check_temporal, check_dimensionality) auto-build a reservoir sample (default 500K rows) when the source is larger. Statistics converge well before that threshold; without sampling, a 7.8M-row source takes ~3.5 minutes per macro call and risks OOM-killing the MCP. With sampling: ~9 seconds, same verdict.
  • Thick tools, thin agent. Macros chain 4–8 granular checks into one call and return Markdown-first output. A 30B local model picking from 35 tools is reliable; picking from 70 is not. See Local mode.
  • Findings are severity-sorted, top-K by default. Every check returns top_findings[≤10] + an artifact_path to a JSON with the full results. Context budget is the scarcest resource.
  • Reproducibility recipe in every result. Each non-trivial tool returns the equivalent SQL (and where useful, Polars/Pandas/Python) so the user can replicate without the MCP.
  • SQL injection-safe. All identifier interpolations go through sqlsafe.quote_ident() (doubles embedded "), so column names with awkward characters work and adversarial inputs don't break out of quoting.
  • Thread-safe DuckDB access. FastMCP dispatches synchronous tool handlers on a thread pool; Session._lock serializes execute/sql so concurrent tool calls can't corrupt session state.

Local mode

For 30B-class models, the 70-tool surface is too wide. Run with --mode local to expose 35 high-level tools:

  • Source/query basics: load_source, list_sources, list_files, run_sql, sample_rows, unload_source
  • 5 macros: auto_explore, auto_quality, auto_modeling_audit, auto_share_check, auto_compare
  • Modality-specific entry points: check_arrays, peek_array, check_geospatial, check_nested_structure, check_custom
  • Analyst-question tools: suggest_plots, check_outliers, compare_groups, check_feature_signal, recommend_tasks, eda_storyboard, generate_report
  • Cleaning + export: auto_clean, clean_pipeline, clean_drop_columns, clean_replace, clean_cast, clean_drop_duplicates, clean_impute, clean_filter, export_source
  • Synthesis: data_card, summarize_run, recommend_next

The macros internally chain the granular tools and return Markdown. Validated against Qwen 27B in LM Studio — completes a full 5-question EDA prompt on 11M-row data without timeouts.


Prompts (slash-commands)

Hosts that surface prompts as a menu (Claude Desktop, Cursor) get eight pre-canned templates that nudge the model toward the right workflow:

Prompt Args What it does
unknown_data path "I have no idea what this is." Calls auto_explore, summarizes.
eda_starter source_id Standard EDA walkthrough with plain-English explanations.
data_audit source_id Critical-issue-only audit. Skips info-severity noise.
pre_modeling_check source_id, target_column Modeling-readiness verdict + hard blockers + soft mitigations.
data_provenance source_id Hypothesizes about origin, population, missing pieces.
compare_datasets source_a, source_b Drop-in compatibility check.
multi_file_starter path Pattern-detect → load → data_card.
share_safety_check source_id PII + fairness gate before sharing externally.

Resources (read-only handles)

  • eda://sources — list of loaded sources
  • eda://sources/{source_id} — schema + metadata for one source
  • eda://runs/{run_id} — full JSON artifact from a prior run_eda
  • eda://plots/{plot_id} — Vega-Lite spec + paths to rendered PNG/SVG
  • eda://plugins — list of registered YAML plugins

Plugins (YAML + SQL)

Drop a YAML file into a directory and pass --plugins /path/to/dir (repeatable). Bundled defaults under plugins_builtin/ load automatically.

Example — age_sanity.yaml:

name: age_sanity
description: Detects rows with biologically implausible ages
category: domain_violation
severity: critical
applies_to:
  has_columns: [age]
  modality: tabular
sql: |
  SELECT
    'age out of range' AS title,
    'age' AS column,
    COUNT(*) AS rows_affected,
    MIN(CAST(age AS BIGINT)) AS min_age_seen,
    MAX(CAST(age AS BIGINT)) AS max_age_seen
  FROM {source}
  WHERE age IS NOT NULL AND (CAST(age AS BIGINT) < 0 OR CAST(age AS BIGINT) > 130)
  HAVING COUNT(*) > 0
interpretation: Ages outside [0, 130] are biologically implausible; usually they're sentinel values.
suggested_action: "Cast to NULL: UPDATE source SET age = NULL WHERE age < 0 OR age > 130."

The SQL must reference the source via the literal {source} placeholder (substituted with the quoted alias at runtime). Each row the SQL returns becomes a Finding. Plugins self-skip when applies_to.has_columns doesn't match the source schema, so you can ship many and only the relevant ones run.

Plugins run automatically as part of run_eda (any preset) and via the standalone check_custom tool. They appear in data_card output alongside built-in checks.


Architecture

┌────────────────────────────────────────────────────────────────┐
│  MCP client (Claude Code, Cursor, Claude Desktop, LM Studio…)  │
└─────────────────────────┬──────────────────────────────────────┘
                          │ stdio JSON-RPC (MCP)
            ┌─────────────▼─────────────┐
            │     edamcp (Python)       │
            │  ─ tool dispatch          │
            │  ─ session (DuckDB views) │
            │  ─ thread-safe lock       │
            │  ─ artifact cache (./_eda)│
            └─────────────┬─────────────┘
                          │
   ┌──────────────────────┼────────────────────────────┐
   │                      │                            │
   ▼                      ▼                            ▼
┌─────────┐       ┌────────────────┐        ┌──────────────────┐
│ ingest  │       │ query engine   │        │ eda checks       │
│ csv/pq/ │◄─────►│ DuckDB views   │◄──────►│ + viz + synth    │
│ json/h5 │       │ + sample tables│        │ + cleaning ops   │
└─────────┘       └────────────────┘        │ + plugins        │
                                            └──────────────────┘
  • DuckDB is the engine. Every loaded source becomes a DuckDB view in a single connection; cleaning pipelines materialize their final result as a real table.
  • Polars for in-memory ops where convenient.
  • Altair + vl-convert for plots — returns Vega-Lite JSON (so the agent can reason about / modify) AND a rendered PNG (for chat UIs).
  • scipy for statistical tests (chi-square, KS, Mann-Whitney, ANOVA); numpy for PCA via SVD.
  • h5py for HDF5 inspection with chunked streaming reduces.

Everything is statelessly orchestratable: tools return either a small summary card with top_findings[≤10] and an artifact_path, or paste-ready Markdown.


Development

# clone + install
git clone https://github.com/charliecpeterson/edamcp.git
cd edamcp
uv sync

# generate the synthetic test datasets
uv run python scripts/generate_test_data.py ~/edamcp-testdata

# run the full smoke suite
EDAMCP_TEST_DATA=~/edamcp-testdata uv run python scripts/smoke_test.py

# verify local-mode tool surface
uv run python scripts/smoke_local_mode.py

The smoke tests exercise every tool end-to-end through an stdio MCP handshake. Set EDAMCP_ANI1_PATH=/path/to/ANI-1_release/ to additionally exercise the HDF5 scientific-array path against real data.


Status

Production-validated:

  • Three months of NYC Yellow Taxi parquet (11M rows, 20 cols, 190 MB) through the full pipeline: load → audit → clean → export → modeling audit → storyboard. Completes in ~15 tool calls under 32K context with Qwen 27B in LM Studio.
  • ANI-1 scientific dataset (3.5 GB HDF5, 47,934 molecule groups, ~288K leaf datasets). Full analysis in ~50 seconds; peek_array reads slices by leaf name with |S1 SMILES auto-joined to readable strings.
  • Synthetic dirty e-commerce with curated quality issues — catches constants, all-nulls, sentinels, type drift, duplicates, multicollinearity, bias, PII, leakage, all in one data_card call.

Smoke coverage: 100+ assertions across 70 tools, both modes, plugins, prompts, resources, and the end-to-end cleaning pipeline.


License

MIT — see LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured