S3 Data Lake MCP Server

S3 Data Lake MCP Server

Enables AI agents to query S3 data lakes using natural language, with support for CSV, JSON, Parquet and tools for data discovery, analysis, and metadata exploration.

Category
Visit Server

README

๐Ÿš€ S3 Data Lake MCP Server

License Python AWS MCP

Transform your S3 data lakes into AI-accessible knowledge bases with natural language queries

A production-ready Model Context Protocol (MCP) server that gives AI agents seamless access to S3 data lakes. Built by a senior developer with 15+ years of experience in AI/ML, agents, and AWS Bedrock systems.

๐ŸŽฏ Why This Exists

I was building ETL systems for AI agents and kept hitting the same wall: How do you give agents seamless access to data lakes without building custom APIs for every single use case?

Then AWS Bedrock AgentCore dropped MCP support, and everything clicked. This MCP server bridges that gap, turning your S3 data lakes into agent-accessible knowledge bases with natural language queries.

โœจ Key Features

๐Ÿ”ฅ 8 Powerful Tools - Complete S3 data lake operations
๐Ÿ“Š Multi-Format Support - CSV, JSON, Parquet with intelligent processing
โšก FastMCP Framework - Modern, high-performance MCP server
๐Ÿ—๏ธ AgentCore Runtime - Serverless, auto-scaling deployment
๐Ÿ›ก๏ธ Production-Grade - Comprehensive error handling, monitoring, security
๐ŸŽฏ Type-Safe - Full Python type hints and validation
๐Ÿš€ Deploy in Minutes - UV package management, one-command deployment

๐Ÿ› ๏ธ Available Tools

Tool Description Use Case
list_s3_buckets List accessible S3 buckets Data discovery
list_s3_objects Browse bucket contents with filtering Dataset exploration
read_csv_from_s3 Parse CSV files with metadata Tabular data analysis
read_json_from_s3 Process JSON objects and arrays Complex data structures
read_parquet_from_s3 Columnar data with full type info High-performance analytics
query_csv_data Filter and query with smart typing Data querying
get_dataset_summary Statistical analysis and profiling Data understanding
get_file_metadata Comprehensive file information Metadata exploration

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.12+
  • UV package manager
  • AWS CLI configured
  • AWS Bedrock AgentCore access

1. Install & Setup

# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install dependencies
git clone https://github.com/anespo/s3-data-lake-mcp-server.git
cd s3-data-lake-mcp-server
uv sync

2. Local Development

# Run the MCP server locally
uv run python run_local.py

# Test in another terminal
uv run pytest tests/ -v

3. Deploy to AWS AgentCore Runtime

# One-command deployment
uv run python deploy_uv.py

# Your Agent ARN will be displayed for integration

4. Generate Demo Data (Optional)

# Create 66.7MB of demo datasets
uv run python generate_mock_data.py

๐Ÿ’ฌ Natural Language Queries

Once integrated with your AI agents, you can ask questions like:

Data Discovery:

  • "What S3 buckets do I have access to?"
  • "Show me all datasets in my analytics bucket"
  • "List CSV files larger than 10MB"

Data Analysis:

  • "Read the customer analytics data and show me the first 10 rows"
  • "Find all sales transactions over $50,000"
  • "What columns are available in the IoT sensor data?"
  • "Show me customers in the Technology industry"

Metadata & Insights:

  • "What's the total size of data in my bucket?"
  • "How many records are in each dataset?"
  • "Give me a statistical summary of the sales data"

๐Ÿ—๏ธ Architecture

Architecture Diagram

Built on Modern Stack:

  • ๐Ÿ—๏ธ AWS Bedrock AgentCore Runtime - Serverless, auto-scaling
  • โšก FastMCP Framework - High-performance MCP server
  • ๐Ÿ“ฆ UV Package Manager - Ultra-fast Python dependency management
  • ๐Ÿ”ง boto3 + pandas + pyarrow - Efficient data processing
  • ๐Ÿ›ก๏ธ AWS SigV4 + IAM - Enterprise-grade security

๐Ÿ”— Integration Examples

Kiro IDE

{
  "mcpServers": {
    "s3-data-lake": {
      "command": "python",
      "args": ["kiro_s3_mcp_wrapper.py"],
      "env": {
        "AWS_REGION": "eu-west-1",
        "AWS_PROFILE": "default"
      }
    }
  }
}

Strands Agents

from strands import Agent
from strands.tools.mcp import MCPClient

# Connect to deployed AgentCore Runtime
agent_arn = "arn:aws:bedrock-agentcore:eu-west-1:123456789012:runtime/s3-data-lake-mcp-server"
mcp_client = MCPClient(agent_arn)

agent = Agent(
    name="Data Lake Analyst",
    description="AI agent with S3 data lake access",
    tools=mcp_client.list_tools_sync()
)

# Natural language data analysis
response = agent("Analyze customer data and find high-value segments")

๐Ÿ“Š Demo Environment

The repository includes a complete demo environment with:

  • 66.7MB of realistic mock data across 3 formats
  • Customer Analytics (CSV, 50K records) - Business intelligence data
  • Sales Transactions (JSON, 75K records) - Financial analysis data
  • IoT Sensor Data (Parquet, 100K records) - Time-series analytics data

Perfect for presentations, testing, and showcasing capabilities without exposing real data.

๐Ÿงช Testing & Quality

# Run comprehensive test suite
uv run pytest tests/ -v --cov=src

# Test specific functionality
uv run pytest tests/test_s3_mcp_server.py::test_read_csv_from_s3 -v

# Test deployed MCP server
uv run python test_deployed_mcp.py

Quality Assurance:

  • โœ… 95%+ test coverage
  • โœ… Type safety with mypy
  • โœ… Production error handling
  • โœ… Performance benchmarking
  • โœ… Security validation

๐Ÿ“š Documentation

Document Description
๐Ÿš€ Deployment Guide Complete deployment instructions
๐Ÿ—๏ธ Architecture System design and components
๐Ÿ”— Integration Guide Kiro and Strands integration
๐Ÿ“‹ API Reference Full tool documentation

๐Ÿ›ก๏ธ Security & Compliance

  • ๐Ÿ” AWS SigV4 Authentication - Industry-standard request signing
  • ๐ŸŽฏ IAM Role-Based Access - Least privilege principle
  • ๐Ÿ”’ No Hardcoded Credentials - Secure credential management
  • ๐Ÿ“Š Comprehensive Logging - Full audit trail
  • ๐Ÿ›ก๏ธ Error Sanitization - No sensitive data in logs

๐Ÿ“ˆ Monitoring & Observability

  • ๐Ÿ“Š CloudWatch Integration - Centralized logging and metrics
  • ๐ŸŽฏ GenAI Observability - Specialized AI/ML monitoring
  • โšก Performance Tracking - Request latency and throughput
  • ๐Ÿšจ Error Alerting - Proactive issue detection

๐Ÿš€ What's Next?

Planned Enhancements:

  • ๐ŸŒ Multi-region deployment support
  • ๐Ÿ” Advanced query capabilities (SQL-like syntax)
  • ๐Ÿ“ก Real-time streaming data support
  • ๐Ÿš€ Enhanced caching layer (Redis/ElastiCache)
  • ๐Ÿค– ML model integration for data insights
  • ๐Ÿ”Œ Plugin architecture for custom tools

๐Ÿค Contributing

Built by the community, for the community:

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒŸ Create a feature branch
  3. โœจ Add your improvements
  4. ๐Ÿงช Add comprehensive tests
  5. ๐Ÿ“ Update documentation
  6. ๐Ÿš€ Submit a pull request

๐Ÿ“„ License

This project is licensed under a custom license allowing non-commercial use. See LICENSE for details.

๐Ÿ‘จโ€๐Ÿ’ป About the Author

Built by Tony Esposito

Turning complex data infrastructure into simple, agent-accessible APIs.

๐Ÿ™‹โ€โ™‚๏ธ Support & Community

  • ๐Ÿ“– Documentation: Comprehensive guides in /docs
  • ๐Ÿ› Issues: GitHub Issues for bugs and feature requests
  • ๐Ÿ’ฌ Discussions: GitHub Discussions for questions
  • ๐Ÿ“ง Contact: tony@mydataclub.com

โญ Star This Repository

If this MCP server helps your AI agents access S3 data lakes, please star the repository! It helps others discover this tool and motivates continued development.


๐Ÿš€ Ready to give your AI agents superpowers with S3 data lake access? Deploy in minutes and start querying with natural language today!

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured