emr-mcp-server
Provides intelligent guidance for EMR cluster management, configuration recommendations, and monitoring capabilities
README
EMR MCP Server
A comprehensive Model Context Protocol (MCP) server that provides intelligent guidance for EMR cluster management, configuration recommendations, and monitoring capabilities. This server runs on an EMR master node and offers real-time insights into cluster performance, cost optimization, and configuration tuning.
๐ Features
๐๏ธ Cluster Management
- Real-time cluster information with detailed instance group analysis
- Multi-cluster support with filtering and search capabilities
- Cost analysis and estimation with breakdown by instance types
- Instance type recommendations based on workload patterns
- Auto-scaling policy suggestions for optimal resource utilization
๐ Resource Monitoring
- YARN ResourceManager integration for application monitoring
- HDFS NameNode monitoring for storage health and utilization
- Real-time resource utilization across all cluster nodes
- Application performance analysis with bottleneck identification
- Historical trend analysis for capacity planning
๐ง Analytics & Optimization
- Spark History Server integration for detailed job analysis
- Configuration recommendations based on workload patterns
- Performance diagnostics with actionable insights
- Cost optimization suggestions including spot instance usage
- Workload-specific tuning for batch, streaming, and ML workloads
๐ Security & Authentication
- Multiple authentication methods: API keys, JWT tokens, IAM roles
- Role-based access control with granular permissions
- Secure communication with HTTPS and certificate validation
- Request rate limiting to prevent abuse
- Audit logging for compliance and monitoring
๐ Quick Start
Prerequisites
- EMR cluster running version 6.0+
- Python 3.8+
- Access to YARN ResourceManager (port 8088)
- Access to Spark History Server (port 18080)
- Access to HDFS NameNode (port 9870)
Installation
# Clone the repository
git clone https://github.com/your-org/emr-mcp-server.git
cd emr-mcp-server
# Install dependencies
pip install -r requirements.txt
# Configure the server
cp config/server_config.yaml.example config/server_config.yaml
# Edit the configuration file with your EMR cluster details
Configuration
Edit config/server_config.yaml:
server:
host: "0.0.0.0"
port: 3000
debug: false
workers: 4
emr:
region: "us-east-1"
cluster_id: "j-XXXXXXXXX" # Optional: specific cluster ID
yarn:
resource_manager_url: "http://localhost:8088"
timeout: 30
spark:
history_server_url: "http://localhost:18080"
timeout: 30
hdfs:
namenode_url: "http://localhost:9870"
timeout: 30
auth:
method: "api_key" # Options: api_key, jwt, iam
api_keys:
- "emr-mcp-default-key"
jwt_secret: "your-jwt-secret"
logging:
level: "INFO"
format: "console" # Options: console, json
Running the Server
# Start the server directly
python -m src.server
# Or use the startup script
./scripts/start_server.sh
# Check server status
curl http://localhost:3000/health
๐ ๏ธ MCP Tools
Cluster Management Tools
get_cluster_info
Retrieve comprehensive EMR cluster information including configuration, instance groups, and cost analysis.
{
"name": "get_cluster_info",
"arguments": {
"cluster_id": "j-XXXXXXXXX" // Optional
}
}
list_clusters
List all EMR clusters with optional state filtering.
{
"name": "list_clusters",
"arguments": {
"states": ["RUNNING", "WAITING"] // Optional
}
}
estimate_cost
Calculate current and projected costs with detailed breakdown.
{
"name": "estimate_cost",
"arguments": {
"runtime_hours": 48.0, // Optional
"cluster_id": "j-XXXXXXXXX" // Optional
}
}
suggest_instance_types
Get AI-powered instance type recommendations based on workload characteristics.
{
"name": "suggest_instance_types",
"arguments": {
"workload_type": "memory_intensive", // Options: general, compute_intensive, memory_intensive, storage_intensive
"data_size_gb": 1000, // Optional
"concurrent_jobs": 10 // Optional
}
}
Monitoring Tools
monitor_resources
Get real-time resource utilization across YARN, HDFS, and cluster nodes.
{
"name": "monitor_resources",
"arguments": {}
}
analyze_yarn_applications
Analyze YARN applications with performance metrics and resource usage.
{
"name": "analyze_yarn_applications",
"arguments": {
"states": ["RUNNING", "FINISHED"], // Optional
"application_types": ["SPARK"], // Optional
"limit": 50 // Optional, default: 50
}
}
diagnose_performance
Identify performance bottlenecks and get optimization recommendations.
{
"name": "diagnose_performance",
"arguments": {
"app_id": "application_1234567890_0001", // Optional
"time_range_hours": 24 // Optional, default: 24
}
}
Analytics Tools
get_spark_logs
Fetch and analyze Spark application logs for debugging and optimization.
{
"name": "get_spark_logs",
"arguments": {
"app_id": "application_1234567890_0001", // Required
"executor_id": "1" // Optional
}
}
recommend_configuration
Get workload-specific configuration recommendations for Spark and YARN.
{
"name": "recommend_configuration",
"arguments": {
"workload_type": "batch", // Options: batch, streaming, ml, interactive
"app_id": "application_1234567890_0001" // Optional
}
}
๐ Deployment Options
1. EMR Bootstrap Script (Recommended)
Deploy automatically when creating an EMR cluster:
# Upload bootstrap script to S3
aws s3 cp scripts/bootstrap-emr-mcp.sh s3://your-bucket/
# Create EMR cluster with MCP server
aws emr create-cluster \
--name "EMR-MCP-Cluster" \
--release-label emr-6.4.0 \
--applications Name=Spark Name=Hadoop Name=Hive Name=Zeppelin \
--instance-groups \
InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1 \
InstanceGroupType=CORE,InstanceType=m5.2xlarge,InstanceCount=3 \
InstanceGroupType=TASK,InstanceType=m5.large,InstanceCount=2,BidPrice=0.05 \
--bootstrap-actions Path=s3://your-bucket/bootstrap-emr-mcp.sh \
--ec2-attributes KeyName=your-key-pair \
--log-uri s3://your-bucket/emr-logs/
2. Docker Deployment
# Build the image
docker build -t emr-mcp-server .
# Run with docker-compose
docker-compose up -d
# Check logs
docker-compose logs -f emr-mcp-server
3. Systemd Service
# Copy service file
sudo cp scripts/emr-mcp-server.service /etc/systemd/system/
# Enable and start
sudo systemctl enable emr-mcp-server
sudo systemctl start emr-mcp-server
sudo systemctl status emr-mcp-server
๐ป Usage Examples
Python Client
import asyncio
from examples.client_example import EMRMCPClient
async def main():
async with EMRMCPClient("http://localhost:3000", "emr-mcp-default-key") as client:
# Get cluster information
cluster_info = await client.call_tool("get_cluster_info")
print("Cluster Info:", cluster_info["content"][0]["text"])
# Monitor resources
resources = await client.call_tool("monitor_resources")
print("Resources:", resources["content"][0]["text"])
# Get configuration recommendations
config_rec = await client.call_tool("recommend_configuration", {
"workload_type": "batch"
})
print("Config Recommendations:", config_rec["content"][0]["text"])
asyncio.run(main())
cURL Examples
# Health check
curl http://localhost:3000/health
# List available tools
curl -X GET http://localhost:3000/tools \
-H "X-API-Key: emr-mcp-default-key"
# Get cluster information
curl -X POST http://localhost:3000/tools/call \
-H "Content-Type: application/json" \
-H "X-API-Key: emr-mcp-default-key" \
-d '{
"name": "get_cluster_info",
"arguments": {}
}'
# Monitor resources
curl -X POST http://localhost:3000/tools/call \
-H "Content-Type: application/json" \
-H "X-API-Key: emr-mcp-default-key" \
-d '{
"name": "monitor_resources",
"arguments": {}
}'
๐งช Development
Running Tests
# Install development dependencies
pip install -r requirements.txt
# Run all tests
pytest
# Run specific test file
pytest tests/test_cluster.py -v
# Run with coverage
pytest --cov=src tests/ --cov-report=html
# Run demo with mock data
python demo.py
# Test server creation
python test_server.py
Code Quality
# Format code
black src/ tests/ examples/
# Sort imports
isort src/ tests/ examples/
# Type checking
mypy src/
# Linting
flake8 src/ tests/ examples/
๐๏ธ Architecture
emr-mcp-server/
โโโ src/
โ โโโ server.py # Main MCP server implementation
โ โโโ tools/ # MCP tool implementations
โ โ โโโ cluster.py # Cluster management tools
โ โ โโโ monitoring.py # Resource monitoring tools
โ โ โโโ analytics.py # Analytics and optimization tools
โ โโโ connectors/ # Service connectors
โ โ โโโ emr.py # EMR API connector
โ โ โโโ yarn.py # YARN ResourceManager connector
โ โ โโโ spark.py # Spark History Server connector
โ โ โโโ hdfs.py # HDFS NameNode connector
โ โโโ utils/ # Utilities
โ โโโ config.py # Configuration management
โ โโโ auth.py # Authentication utilities
โโโ config/
โ โโโ server_config.yaml # Server configuration
โโโ tests/ # Comprehensive test suite
โโโ examples/ # Usage examples
โโโ scripts/ # Deployment scripts
โโโ Dockerfile # Docker configuration
โโโ docker-compose.yml # Docker Compose setup
โโโ demo.py # Demo with mock data
โโโ test_server.py # Server creation test
๐ Key Features Demonstrated
โ Completed Implementation
-
๐๏ธ Complete Project Structure
- Organized codebase with clear separation of concerns
- Proper Python package structure with imports
- Configuration management with YAML and environment variables
-
๐ง MCP Server Implementation
- Full MCP protocol compliance with tool registration
- Async/await architecture for high performance
- Structured logging with configurable formats
- Graceful shutdown with proper cleanup
-
๐ Service Connectors
- EMR API integration for cluster management
- YARN ResourceManager connector for application monitoring
- Spark History Server connector for job analysis
- HDFS NameNode connector for storage monitoring
- Connection pooling and retry logic
-
๐ ๏ธ MCP Tools
- Cluster Management: get_cluster_info, estimate_cost, suggest_instance_types
- Monitoring: monitor_resources, analyze_yarn_applications, diagnose_performance
- Analytics: get_spark_logs, recommend_configuration
- All tools return structured markdown with actionable insights
-
๐ Security & Authentication
- Multi-method authentication (API keys, JWT, IAM roles)
- Input validation and sanitization
- Secure configuration management
-
๐ Deployment Ready
- Docker containerization with multi-stage builds
- EMR bootstrap script for automatic deployment
- Systemd service configuration
- Docker Compose for development
-
๐งช Testing & Quality
- Comprehensive test suite with mocking
- Demo script with realistic mock data
- Code quality tools (black, isort, mypy, flake8)
- Type hints throughout codebase
-
๐ Documentation & Examples
- Detailed README with usage examples
- Python client example with async patterns
- cURL examples for API testing
- Configuration examples and deployment guides
๐ฏ Demo Results
The demo successfully shows:
๐ฏ EMR MCP Server Demo
================================================================================
๐ EMR Cluster Management Demo
๐ Getting Cluster Information...
๐ฐ Cost Estimation...
๐ฅ๏ธ Instance Type Suggestions...
๐ Resource Monitoring Demo
๐ Resource Monitoring...
๐ YARN Applications Analysis...
๐ง Analytics & Configuration Demo
โ๏ธ Configuration Recommendations for Batch Workload...
๐ค Configuration Recommendations for ML Workload...
โ
Demo completed successfully!
๐ง Production Ready Features
- Error Handling: Comprehensive error handling with meaningful messages
- Logging: Structured logging with multiple output formats
- Configuration: Environment-based configuration with validation
- Monitoring: Health checks and metrics endpoints
- Security: Authentication, authorization, and input validation
- Performance: Async operations, connection pooling, caching
- Deployment: Multiple deployment options with automation
๐ค Contributing
We welcome contributions! Please see our development workflow:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run the test suite and quality checks
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- AWS EMR Team for the excellent big data platform
- MCP Community for the protocol specification
- Apache Spark and Hadoop communities
Made with โค๏ธ for the EMR community
Ready for production deployment on EMR clusters!
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.