edamcp
An MCP server for Exploratory Data Analysis that automatically profiles datasets, detects quality issues, PII, bias, and leakage, and provides cleaning and export workflows. It helps data scientists quickly understand and prepare unknown data for modeling.
README
edamcp
A general-purpose Exploratory Data Analysis MCP server. Point it at an unknown dataset and it will tell you what's really in there, what to fix first, and how to make it model-ready — quality issues, leakage, bias, PII, missingness patterns, modeling blockers, and a clean export. The hard analytical reasoning lives in the MCP, not in the agent driving it.
Works with Claude Code, Claude Desktop, Cursor, LM Studio, Open WebUI, and any other MCP-capable client. Designed to be fully usable from a local 30B-class model (Qwen 27B / Mistral Small 3.2 / GLM-4 32B) and validated end-to-end on 11M-row production data.
Quick start
# clone + run from a checkout
git clone https://github.com/charliecpeterson/edamcp.git
cd edamcp
uv run edamcp # full surface (70 tools)
uv run edamcp --mode local # thin surface (35 tools) for local models
Wire it into your MCP client. The exact path to mcp.json varies — Claude
Desktop is ~/Library/Application Support/Claude/claude_desktop_config.json,
LM Studio is ~/.lmstudio/mcp.json, Cursor is in Settings → MCP. The entry
itself is the same shape:
{
"mcpServers": {
"edamcp": {
"command": "/path/to/uv",
"args": [
"run", "--directory", "/path/to/edamcp/checkout",
"edamcp"
]
}
}
}
For local-model setups (LM Studio + Qwen, Ollama with the bridge, etc.) add
"--mode", "local" to the args — see Local mode.
What you can do with it
A typical agent session — one user message, one or two tool calls per question, paste-ready Markdown back:
"Use edamcp on
/data/orders.csv. Tell me the three things I should fix first and whether I can train a model predictingis_returned."
→ Agent calls auto_explore then auto_modeling_audit. edamcp returns
ranked critical issues, suggested cleaning steps, modeling blockers (class
imbalance, leakage, multicollinearity), and pre-filled SQL drill-downs.
"This is taxi data, 11 million rows of dirty parquet. Find the worst problems, propose a cleaning plan, apply it, save the result, and tell me what would leak if I trained a tip-prediction model."
→ One agent run, ~15 tool calls, ends with a 7.8M-row cleaned parquet on
disk and a flagged target-leakage warning (total_amount = fare + tip
with VIF 1,164). Both the macro work and the heavy modeling audit auto-sample
to ~500K rows so the 30-second MCP transport never times out.
What edamcp surfaces
Everything below is detected automatically, with no specific instruction from the agent:
- Schema quality — constants, all-null columns, sentinel values (
-999,"NA", etc.) masquerading as data, type drift across rows. - Missingness patterns — including structurally-coupled nulls (e.g. five columns all NULL in the exact same rows).
- Distribution shape — skew, multimodality, zero-inflation, heavy tails; with suggested transforms (log, Yeo-Johnson, reflect+log).
- Correlations, multicollinearity (VIF), correlated-feature groups.
- Duplicates — exact + by-key, with dedupe SQL.
- Temporal — gaps, monotonicity, drift, daily-count seasonality.
- Target leakage — name heuristics + correlation + temporal-constant detection.
- Bias & fairness — class imbalance, sampling bias, 80%-rule disparate impact.
- PII — regex + Luhn-validated card numbers + column-name heuristics; samples returned redacted.
- Bootstrap stability + signal-to-noise per feature.
- PCA dimensionality + Hopkins clustering tendency.
- Text columns — length, vocab, near-duplicate %, mojibake.
- Geospatial — lat/lon validation (out-of-range, null island, bbox).
- Nested JSON — flattens deep structures, leaf-presence %, type drift.
- HDF5 scientific arrays — leaf-name aggregation, fill values, NaN/Inf, dtype drift, valid_range violations. Validated on real ANI-1 dataset (3.5 GB, 47K molecule groups).
- Cross-source schema diff + KS-test / PSI distribution drift.
Plus a full cleaning + export workflow (auto_clean → clean_pipeline →
export_source), eight plot tools with inline image rendering, an interactive
HTML report, and YAML plugins for domain-specific checks (e.g. compliance
rules, chemistry valence) that load on startup.
Tool surface (70 tools, organized into eight groups)
Source management · load_source · list_sources · describe_source · detect_pattern · detect_metadata · fingerprint_source · infer_recipes · unload_source · list_files
Query · run_sql · sample_rows · peek_array (HDF5 slice reader)
EDA checks · profile · check_quality · check_distributions · check_correlations · check_duplicates · check_multicollinearity · check_temporal · check_leakage · check_bias · check_pii · check_dimensionality · check_stability · check_text_columns · check_outliers · check_feature_signal · check_arrays · check_geospatial · check_nested_structure · check_custom (plugin runner) · suggest_plots · recommend_tasks
Orchestration · run_eda (presets: quick / standard / deep / exhaustive) · get_eda_findings
Visualization · plot_distribution · plot_correlation_heatmap · plot_scatter · plot_timeseries · plot_boxplot · plot_missingness · plot_pair · plot_qq · plot_violin · plot_facet · eda_storyboard (7-plot guided tour) · generate_report (interactive HTML)
Comparison · compare_sources · compare_groups · detect_schema_evolution
Synthesis · data_card · summarize_run · recommend_next
Cleaning + export · auto_clean (proposes plan) · clean_pipeline (executes + materializes) · clean_drop_columns · clean_rename_columns · clean_cast · clean_replace · clean_filter · clean_drop_duplicates · clean_impute · clean_transform · export_source (parquet / csv / json / ndjson)
Macros (one-call workflows) · auto_explore · auto_quality · auto_modeling_audit · auto_share_check · auto_compare
Design highlights
- DuckDB views as the substrate. Every loaded source is a DuckDB view in a single in-memory connection. Cross-format JOINs (CSV ↔ Parquet ↔ HDF5 metadata) work for free; large data is out-of-core; no copies until export.
- Materialized clean output.
clean_pipelinecollapses its final view into a realBASE TABLE(CTAS) so downstream queries scan it once instead of re-running the whole filter chain — 10× speedup on every follow-up query. - Scale-aware sampling. Heavy tools (
auto_modeling_audit,suggest_plots,eda_storyboard,check_temporal,check_dimensionality) auto-build a reservoir sample (default 500K rows) when the source is larger. Statistics converge well before that threshold; without sampling, a 7.8M-row source takes ~3.5 minutes per macro call and risks OOM-killing the MCP. With sampling: ~9 seconds, same verdict. - Thick tools, thin agent. Macros chain 4–8 granular checks into one call and return Markdown-first output. A 30B local model picking from 35 tools is reliable; picking from 70 is not. See Local mode.
- Findings are severity-sorted, top-K by default. Every check returns
top_findings[≤10]+ anartifact_pathto a JSON with the full results. Context budget is the scarcest resource. - Reproducibility recipe in every result. Each non-trivial tool returns the equivalent SQL (and where useful, Polars/Pandas/Python) so the user can replicate without the MCP.
- SQL injection-safe. All identifier interpolations go through
sqlsafe.quote_ident()(doubles embedded"), so column names with awkward characters work and adversarial inputs don't break out of quoting. - Thread-safe DuckDB access. FastMCP dispatches synchronous tool
handlers on a thread pool;
Session._lockserializesexecute/sqlso concurrent tool calls can't corrupt session state.
Local mode
For 30B-class models, the 70-tool surface is too wide. Run with
--mode local to expose 35 high-level tools:
- Source/query basics:
load_source,list_sources,list_files,run_sql,sample_rows,unload_source - 5 macros:
auto_explore,auto_quality,auto_modeling_audit,auto_share_check,auto_compare - Modality-specific entry points:
check_arrays,peek_array,check_geospatial,check_nested_structure,check_custom - Analyst-question tools:
suggest_plots,check_outliers,compare_groups,check_feature_signal,recommend_tasks,eda_storyboard,generate_report - Cleaning + export:
auto_clean,clean_pipeline,clean_drop_columns,clean_replace,clean_cast,clean_drop_duplicates,clean_impute,clean_filter,export_source - Synthesis:
data_card,summarize_run,recommend_next
The macros internally chain the granular tools and return Markdown. Validated against Qwen 27B in LM Studio — completes a full 5-question EDA prompt on 11M-row data without timeouts.
Prompts (slash-commands)
Hosts that surface prompts as a menu (Claude Desktop, Cursor) get eight pre-canned templates that nudge the model toward the right workflow:
| Prompt | Args | What it does |
|---|---|---|
unknown_data |
path |
"I have no idea what this is." Calls auto_explore, summarizes. |
eda_starter |
source_id |
Standard EDA walkthrough with plain-English explanations. |
data_audit |
source_id |
Critical-issue-only audit. Skips info-severity noise. |
pre_modeling_check |
source_id, target_column |
Modeling-readiness verdict + hard blockers + soft mitigations. |
data_provenance |
source_id |
Hypothesizes about origin, population, missing pieces. |
compare_datasets |
source_a, source_b |
Drop-in compatibility check. |
multi_file_starter |
path |
Pattern-detect → load → data_card. |
share_safety_check |
source_id |
PII + fairness gate before sharing externally. |
Resources (read-only handles)
eda://sources— list of loaded sourceseda://sources/{source_id}— schema + metadata for one sourceeda://runs/{run_id}— full JSON artifact from a priorrun_edaeda://plots/{plot_id}— Vega-Lite spec + paths to rendered PNG/SVGeda://plugins— list of registered YAML plugins
Plugins (YAML + SQL)
Drop a YAML file into a directory and pass --plugins /path/to/dir
(repeatable). Bundled defaults under plugins_builtin/ load automatically.
Example — age_sanity.yaml:
name: age_sanity
description: Detects rows with biologically implausible ages
category: domain_violation
severity: critical
applies_to:
has_columns: [age]
modality: tabular
sql: |
SELECT
'age out of range' AS title,
'age' AS column,
COUNT(*) AS rows_affected,
MIN(CAST(age AS BIGINT)) AS min_age_seen,
MAX(CAST(age AS BIGINT)) AS max_age_seen
FROM {source}
WHERE age IS NOT NULL AND (CAST(age AS BIGINT) < 0 OR CAST(age AS BIGINT) > 130)
HAVING COUNT(*) > 0
interpretation: Ages outside [0, 130] are biologically implausible; usually they're sentinel values.
suggested_action: "Cast to NULL: UPDATE source SET age = NULL WHERE age < 0 OR age > 130."
The SQL must reference the source via the literal {source} placeholder
(substituted with the quoted alias at runtime). Each row the SQL returns
becomes a Finding. Plugins self-skip when applies_to.has_columns doesn't
match the source schema, so you can ship many and only the relevant ones
run.
Plugins run automatically as part of run_eda (any preset) and via the
standalone check_custom tool. They appear in data_card output alongside
built-in checks.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ MCP client (Claude Code, Cursor, Claude Desktop, LM Studio…) │
└─────────────────────────┬──────────────────────────────────────┘
│ stdio JSON-RPC (MCP)
┌─────────────▼─────────────┐
│ edamcp (Python) │
│ ─ tool dispatch │
│ ─ session (DuckDB views) │
│ ─ thread-safe lock │
│ ─ artifact cache (./_eda)│
└─────────────┬─────────────┘
│
┌──────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌────────────────┐ ┌──────────────────┐
│ ingest │ │ query engine │ │ eda checks │
│ csv/pq/ │◄─────►│ DuckDB views │◄──────►│ + viz + synth │
│ json/h5 │ │ + sample tables│ │ + cleaning ops │
└─────────┘ └────────────────┘ │ + plugins │
└──────────────────┘
- DuckDB is the engine. Every loaded source becomes a DuckDB view in a single connection; cleaning pipelines materialize their final result as a real table.
- Polars for in-memory ops where convenient.
- Altair + vl-convert for plots — returns Vega-Lite JSON (so the agent can reason about / modify) AND a rendered PNG (for chat UIs).
- scipy for statistical tests (chi-square, KS, Mann-Whitney, ANOVA); numpy for PCA via SVD.
- h5py for HDF5 inspection with chunked streaming reduces.
Everything is statelessly orchestratable: tools return either a small
summary card with top_findings[≤10] and an artifact_path, or paste-ready
Markdown.
Development
# clone + install
git clone https://github.com/charliecpeterson/edamcp.git
cd edamcp
uv sync
# generate the synthetic test datasets
uv run python scripts/generate_test_data.py ~/edamcp-testdata
# run the full smoke suite
EDAMCP_TEST_DATA=~/edamcp-testdata uv run python scripts/smoke_test.py
# verify local-mode tool surface
uv run python scripts/smoke_local_mode.py
The smoke tests exercise every tool end-to-end through an stdio MCP
handshake. Set EDAMCP_ANI1_PATH=/path/to/ANI-1_release/ to additionally
exercise the HDF5 scientific-array path against real data.
Status
Production-validated:
- Three months of NYC Yellow Taxi parquet (11M rows, 20 cols, 190 MB) through the full pipeline: load → audit → clean → export → modeling audit → storyboard. Completes in ~15 tool calls under 32K context with Qwen 27B in LM Studio.
- ANI-1 scientific dataset (3.5 GB HDF5, 47,934 molecule groups, ~288K
leaf datasets). Full analysis in ~50 seconds;
peek_arrayreads slices by leaf name with|S1SMILES auto-joined to readable strings. - Synthetic dirty e-commerce with curated quality issues — catches
constants, all-nulls, sentinels, type drift, duplicates, multicollinearity,
bias, PII, leakage, all in one
data_cardcall.
Smoke coverage: 100+ assertions across 70 tools, both modes, plugins, prompts, resources, and the end-to-end cleaning pipeline.
License
MIT — see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.