DataBeak

DataBeak

Provides 40+ specialized tools for AI assistants to load, transform, analyze, and validate CSV data from URLs and string content through the Model Context Protocol.

Category
Visit Server

README

DataBeak

Tests codecov Python 3.12+ License Code style: ruff

AI-Powered CSV Processing via Model Context Protocol

Transform how AI assistants work with CSV data. DataBeak provides 40+ specialized tools for data manipulation, analysis, and validation through the Model Context Protocol (MCP).

Features

  • 🔄 Complete Data Operations - Load, transform, and analyze CSV data from URLs and string content
  • 📊 Advanced Analytics - Statistics, correlations, outlier detection, data profiling
  • Data Validation - Schema validation, quality scoring, anomaly detection
  • 🎯 Stateless Design - Clean MCP architecture with external context management
  • High Performance - Async I/O, streaming downloads, chunked processing
  • 🔒 Session Management - Multi-user support with isolated sessions
  • 🛡️ Web-Safe - No file system access; designed for secure web hosting
  • 🌟 Code Quality - Zero ruff violations, 100% mypy compliance, perfect MCP documentation standards, comprehensive test coverage

Getting Started

The fastest way to use DataBeak is with uvx (no installation required):

For Claude Desktop

Add this to your MCP Settings file:

{
  "mcpServers": {
    "databeak": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/jonpspri/databeak.git",
        "databeak"
      ]
    }
  }
}

For Other AI Clients

DataBeak works with Continue, Cline, Windsurf, and Zed. See the installation guide for specific configuration examples.

HTTP Mode (Advanced)

For HTTP-based AI clients or custom deployments:

# Run in HTTP mode
uv run databeak --transport http --host 0.0.0.0 --port 8000

# Access server at http://localhost:8000/mcp
# Health check at http://localhost:8000/health

Quick Test

Once configured, ask your AI assistant:

"Load this CSV data: name,price\nWidget,10.99\nGadget,25.50"
"Load CSV from URL: https://example.com/data.csv"
"Remove duplicate rows and show me the statistics"
"Find outliers in the price column"

Documentation

📚 Complete Documentation

Environment Variables

Configure DataBeak behavior with environment variables (all use DATABEAK_ prefix):

Variable Default Description
DATABEAK_SESSION_TIMEOUT 3600 Session timeout (seconds)
DATABEAK_MAX_DOWNLOAD_SIZE_MB 100 Maximum URL download size (MB)
DATABEAK_MAX_MEMORY_USAGE_MB 1000 Max DataFrame memory (MB)
DATABEAK_MAX_ROWS 1,000,000 Max DataFrame rows
DATABEAK_URL_TIMEOUT_SECONDS 30 URL download timeout
DATABEAK_HEALTH_MEMORY_THRESHOLD_MB 2048 Health monitoring memory threshold

See settings.py for complete configuration options.

Known Limitations

DataBeak is designed for interactive CSV processing with AI assistants. Be aware of these constraints:

  • Data Loading: URLs and string content only (no local file system access for web hosting security)
  • Download Size: Maximum 100MB per URL download (configurable via DATABEAK_MAX_DOWNLOAD_SIZE_MB)
  • DataFrame Size: Maximum 1GB memory and 1M rows per DataFrame (configurable)
  • Session Management: Maximum 100 concurrent sessions, 1-hour timeout (configurable)
  • Memory: Large datasets may require significant memory; monitor with health_check tool
  • CSV Dialects: Assumes standard CSV format; complex dialects may require pre-processing
  • Concurrency: Async I/O for concurrent URL downloads; parallel sessions supported
  • Data Types: Automatic type inference; complex types may need explicit conversion
  • URL Loading: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x, 10.x.x.x) for security

For production deployments with larger datasets, adjust environment variables and monitor resource usage with health_check and get_server_info tools.

Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests
  4. Run quality checks: uv run -m pytest
  5. Submit a pull request

Note: All changes must go through pull requests. Direct commits to main are blocked by pre-commit hooks.

Development

# Setup development environment
git clone https://github.com/jonpspri/databeak.git
cd databeak
uv sync

# Run the server locally
uv run databeak

# Run tests
uv run -m pytest tests/unit/          # Unit tests (primary)
uv run -m pytest                      # All tests

# Run quality checks
uv run ruff check
uv run mypy src/databeak/

Testing Structure

DataBeak implements comprehensive unit and integration testing:

  • Unit Tests (tests/unit/) - 940+ fast, isolated module tests
  • Integration Tests (tests/integration/) - 43 FastMCP Client-based protocol tests across 7 test files
  • E2E Tests (tests/e2e/) - Planned: Complete workflow validation

Test Execution:

uv run pytest -n auto tests/unit/          # Run unit tests (940+ tests)
uv run pytest -n auto tests/integration/   # Run integration tests (43 tests)
uv run pytest -n auto --cov=src/databeak   # Run with coverage analysis

See Testing Guide for comprehensive testing details.

License

Apache 2.0 - see LICENSE file.

Support

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured