S3 Data Lake MCP Server
Enables AI agents to query S3 data lakes using natural language, with support for CSV, JSON, Parquet and tools for data discovery, analysis, and metadata exploration.
README
๐ S3 Data Lake MCP Server
Transform your S3 data lakes into AI-accessible knowledge bases with natural language queries
A production-ready Model Context Protocol (MCP) server that gives AI agents seamless access to S3 data lakes. Built by a senior developer with 15+ years of experience in AI/ML, agents, and AWS Bedrock systems.
๐ฏ Why This Exists
I was building ETL systems for AI agents and kept hitting the same wall: How do you give agents seamless access to data lakes without building custom APIs for every single use case?
Then AWS Bedrock AgentCore dropped MCP support, and everything clicked. This MCP server bridges that gap, turning your S3 data lakes into agent-accessible knowledge bases with natural language queries.
โจ Key Features
๐ฅ 8 Powerful Tools - Complete S3 data lake operations
๐ Multi-Format Support - CSV, JSON, Parquet with intelligent processing
โก FastMCP Framework - Modern, high-performance MCP server
๐๏ธ AgentCore Runtime - Serverless, auto-scaling deployment
๐ก๏ธ Production-Grade - Comprehensive error handling, monitoring, security
๐ฏ Type-Safe - Full Python type hints and validation
๐ Deploy in Minutes - UV package management, one-command deployment
๐ ๏ธ Available Tools
| Tool | Description | Use Case |
|---|---|---|
list_s3_buckets |
List accessible S3 buckets | Data discovery |
list_s3_objects |
Browse bucket contents with filtering | Dataset exploration |
read_csv_from_s3 |
Parse CSV files with metadata | Tabular data analysis |
read_json_from_s3 |
Process JSON objects and arrays | Complex data structures |
read_parquet_from_s3 |
Columnar data with full type info | High-performance analytics |
query_csv_data |
Filter and query with smart typing | Data querying |
get_dataset_summary |
Statistical analysis and profiling | Data understanding |
get_file_metadata |
Comprehensive file information | Metadata exploration |
๐ Quick Start
Prerequisites
- Python 3.12+
- UV package manager
- AWS CLI configured
- AWS Bedrock AgentCore access
1. Install & Setup
# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install dependencies
git clone https://github.com/anespo/s3-data-lake-mcp-server.git
cd s3-data-lake-mcp-server
uv sync
2. Local Development
# Run the MCP server locally
uv run python run_local.py
# Test in another terminal
uv run pytest tests/ -v
3. Deploy to AWS AgentCore Runtime
# One-command deployment
uv run python deploy_uv.py
# Your Agent ARN will be displayed for integration
4. Generate Demo Data (Optional)
# Create 66.7MB of demo datasets
uv run python generate_mock_data.py
๐ฌ Natural Language Queries
Once integrated with your AI agents, you can ask questions like:
Data Discovery:
- "What S3 buckets do I have access to?"
- "Show me all datasets in my analytics bucket"
- "List CSV files larger than 10MB"
Data Analysis:
- "Read the customer analytics data and show me the first 10 rows"
- "Find all sales transactions over $50,000"
- "What columns are available in the IoT sensor data?"
- "Show me customers in the Technology industry"
Metadata & Insights:
- "What's the total size of data in my bucket?"
- "How many records are in each dataset?"
- "Give me a statistical summary of the sales data"
๐๏ธ Architecture

Built on Modern Stack:
- ๐๏ธ AWS Bedrock AgentCore Runtime - Serverless, auto-scaling
- โก FastMCP Framework - High-performance MCP server
- ๐ฆ UV Package Manager - Ultra-fast Python dependency management
- ๐ง boto3 + pandas + pyarrow - Efficient data processing
- ๐ก๏ธ AWS SigV4 + IAM - Enterprise-grade security
๐ Integration Examples
Kiro IDE
{
"mcpServers": {
"s3-data-lake": {
"command": "python",
"args": ["kiro_s3_mcp_wrapper.py"],
"env": {
"AWS_REGION": "eu-west-1",
"AWS_PROFILE": "default"
}
}
}
}
Strands Agents
from strands import Agent
from strands.tools.mcp import MCPClient
# Connect to deployed AgentCore Runtime
agent_arn = "arn:aws:bedrock-agentcore:eu-west-1:123456789012:runtime/s3-data-lake-mcp-server"
mcp_client = MCPClient(agent_arn)
agent = Agent(
name="Data Lake Analyst",
description="AI agent with S3 data lake access",
tools=mcp_client.list_tools_sync()
)
# Natural language data analysis
response = agent("Analyze customer data and find high-value segments")
๐ Demo Environment
The repository includes a complete demo environment with:
- 66.7MB of realistic mock data across 3 formats
- Customer Analytics (CSV, 50K records) - Business intelligence data
- Sales Transactions (JSON, 75K records) - Financial analysis data
- IoT Sensor Data (Parquet, 100K records) - Time-series analytics data
Perfect for presentations, testing, and showcasing capabilities without exposing real data.
๐งช Testing & Quality
# Run comprehensive test suite
uv run pytest tests/ -v --cov=src
# Test specific functionality
uv run pytest tests/test_s3_mcp_server.py::test_read_csv_from_s3 -v
# Test deployed MCP server
uv run python test_deployed_mcp.py
Quality Assurance:
- โ 95%+ test coverage
- โ Type safety with mypy
- โ Production error handling
- โ Performance benchmarking
- โ Security validation
๐ Documentation
| Document | Description |
|---|---|
| ๐ Deployment Guide | Complete deployment instructions |
| ๐๏ธ Architecture | System design and components |
| ๐ Integration Guide | Kiro and Strands integration |
| ๐ API Reference | Full tool documentation |
๐ก๏ธ Security & Compliance
- ๐ AWS SigV4 Authentication - Industry-standard request signing
- ๐ฏ IAM Role-Based Access - Least privilege principle
- ๐ No Hardcoded Credentials - Secure credential management
- ๐ Comprehensive Logging - Full audit trail
- ๐ก๏ธ Error Sanitization - No sensitive data in logs
๐ Monitoring & Observability
- ๐ CloudWatch Integration - Centralized logging and metrics
- ๐ฏ GenAI Observability - Specialized AI/ML monitoring
- โก Performance Tracking - Request latency and throughput
- ๐จ Error Alerting - Proactive issue detection
๐ What's Next?
Planned Enhancements:
- ๐ Multi-region deployment support
- ๐ Advanced query capabilities (SQL-like syntax)
- ๐ก Real-time streaming data support
- ๐ Enhanced caching layer (Redis/ElastiCache)
- ๐ค ML model integration for data insights
- ๐ Plugin architecture for custom tools
๐ค Contributing
Built by the community, for the community:
- ๐ด Fork the repository
- ๐ Create a feature branch
- โจ Add your improvements
- ๐งช Add comprehensive tests
- ๐ Update documentation
- ๐ Submit a pull request
๐ License
This project is licensed under a custom license allowing non-commercial use. See LICENSE for details.
๐จโ๐ป About the Author
Built by Tony Esposito
Turning complex data infrastructure into simple, agent-accessible APIs.
๐โโ๏ธ Support & Community
- ๐ Documentation: Comprehensive guides in
/docs - ๐ Issues: GitHub Issues for bugs and feature requests
- ๐ฌ Discussions: GitHub Discussions for questions
- ๐ง Contact: tony@mydataclub.com
โญ Star This Repository
If this MCP server helps your AI agents access S3 data lakes, please star the repository! It helps others discover this tool and motivates continued development.
๐ Ready to give your AI agents superpowers with S3 data lake access? Deploy in minutes and start querying with natural language today!
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.