Multi-Cloud Infrastructure MCP Server
Enables deployment and management of GPU workloads across multiple cloud providers (RunPod, Vast.ai) with intelligent GPU selection, resource monitoring, and telemetry tracking through Redis, ClickHouse, and SkyPilot integration.
README
MCP (Multi-Cloud Platform) Server
This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint).
Highlights
- Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.
- SkyPilot Agent builds dynamic YAML and executes the
skyCLI. - OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).
- Orchestrator wires agents together using Redis queues and ClickHouse telemetry.
- Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.
- Phoenix observability hooks and Envoy integration for telemetry events.
Additional files
scripts/resource_heartbeat.py— example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator.
Quick start (local dry-run)
- Install Python packages:
pip install -r requirements.txt - Start Redis (e.g.
docker run -p 6379:6379 -d redis) and optionally ClickHouse. - Start the MCP server:
python -m src.mcp.main - Push a demo task into Redis (see
scripts/run_demo.sh) - Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in
.env).
Notes & caveats
- This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline.
MCP Orchestrator - Quick Reference
🚀 Installation (5 minutes)
# 1. Configure environment
cp .env.example .env
nano .env # Add your API keys
# 2. Deploy everything
chmod +x scripts/deploy.sh
./scripts/deploy.sh
# 3. Verify
curl http://localhost:8000/health
📡 Common API Calls
Deploy with Auto GPU Selection
# Inference workload (will select cost-effective GPU)
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{
"task_type": "inference",
"spec": {
"name": "llm-server",
"image": "vllm/vllm-openai:latest",
"command": "python -m vllm.entrypoints.api_server"
}
}'
# Training workload (will select powerful GPU)
curl -X POST http://localhost:8000/api/v1/providers/vastai/deploy \
-H "Content-Type: application/json" \
-d '{
"task_type": "training",
"spec": {
"name": "fine-tune-job",
"image": "pytorch/pytorch:latest"
}
}'
Deploy with Specific GPU
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{
"spec": {
"name": "custom-pod",
"gpu_name": "RTX 4090",
"resources": {
"accelerators": "RTX 4090:2"
}
}
}'
Deploy to Provider (Default: ON_DEMAND + RTX 3060)
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{"spec": {"name": "simple-pod"}}'
Register Existing Infrastructure
# Vast.ai instance
curl -X POST http://localhost:8000/api/v1/register \
-H "Content-Type: application/json" \
-d '{
"provider": "vastai",
"resource_id": "12345",
"credentials": {"api_key": "YOUR_VASTAI_KEY"}
}'
# Bulk registration
curl -X POST http://localhost:8000/api/v1/register \
-H "Content-Type: application/json" \
-d '{
"provider": "vastai",
"resource_ids": ["12345", "67890"],
"credentials": {"api_key": "YOUR_VASTAI_KEY"}
}'
List Resources
# All RunPod resources
curl http://localhost:8000/api/v1/providers/runpod/list
# All Vast.ai resources
curl http://localhost:8000/api/v1/providers/vastai/list
Terminate Resource
curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/pod_abc123
🎯 GPU Rules Management
View Rules
curl http://localhost:8000/api/v1/gpu-rules
Add Rule
curl -X POST http://localhost:8000/api/v1/gpu-rules \
-H "Content-Type: application/json" \
-d '{
"gpu_family": "H100",
"type": "Enterprise",
"min_use_case": "large-scale training",
"optimal_use_case": "foundation models",
"power_rating": "700W",
"typical_cloud_instance": "RunPod",
"priority": 0
}'
Delete Rule
curl -X DELETE http://localhost:8000/api/v1/gpu-rules/RTX%203060
🔍 Monitoring
ClickHouse Queries
-- Active resources
SELECT provider, status, count() as total
FROM resources
WHERE status IN ('running', 'active')
GROUP BY provider, status;
-- Recent deployments
SELECT *
FROM deployments
ORDER BY created_at DESC
LIMIT 10;
-- Latest heartbeats
SELECT resource_id, status, timestamp
FROM heartbeats
WHERE timestamp > now() - INTERVAL 5 MINUTE
ORDER BY timestamp DESC;
-- Cost analysis
SELECT
provider,
sum(price_hour) as total_hourly_cost,
avg(price_hour) as avg_cost
FROM resources
WHERE status = 'running'
GROUP BY provider;
-- Event volume
SELECT
event_type,
count() as count,
toStartOfHour(timestamp) as hour
FROM events
WHERE timestamp > now() - INTERVAL 24 HOUR
GROUP BY event_type, hour
ORDER BY hour DESC, count DESC;
View Logs
# All services
docker-compose logs -f
# API only
docker-compose logs -f mcp-api
# Heartbeat monitor
docker-compose logs -f heartbeat-worker
# ClickHouse
docker-compose logs -f clickhouse
🛠️ Maintenance
Restart Services
# Restart all
docker-compose restart
# Restart API only
docker-compose restart mcp-api
# Reload with new code
docker-compose up -d --build
Backup ClickHouse
# Backup database
docker-compose exec clickhouse clickhouse-client --query \
"BACKUP DATABASE mcp TO Disk('default', 'backup_$(date +%Y%m%d).zip')"
# Export table
docker-compose exec clickhouse clickhouse-client --query \
"SELECT * FROM resources FORMAT CSVWithNames" > resources_backup.csv
Clean Up
# Stop all services
docker-compose down
# Stop and remove volumes (WARNING: deletes data)
docker-compose down -v
# Prune old data from ClickHouse (events older than 90 days auto-expire)
docker-compose exec clickhouse clickhouse-client --query \
"OPTIMIZE TABLE events FINAL"
🐛 Troubleshooting
Service won't start
# Check status
docker-compose ps
# Check logs
docker-compose logs mcp-api
# Verify config
cat .env | grep -v '^#' | grep -v '^$'
ClickHouse connection issues
# Test connection
docker-compose exec clickhouse clickhouse-client --query "SELECT 1"
# Reinitialize
docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql
# Check tables
docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp"
API returns 404 for provider
# Check if agent initialized
docker-compose logs mcp-api | grep -i "AgentRegistry initialized"
# Restart with fresh logs
docker-compose restart mcp-api && docker-compose logs -f mcp-api
Heartbeat not working
# Check heartbeat worker
docker-compose logs heartbeat-worker
# Manual health check
curl http://localhost:8000/api/v1/providers/runpod/list
📝 Environment Variables
Key variables in .env:
# Required
RUNPOD_API_KEY=xxx # Your RunPod API key
VASTAI_API_KEY=xxx # Your Vast.ai API key (used per-request only)
# ClickHouse
CLICKHOUSE_PASSWORD=xxx # Set strong password
# Optional
LOG_LEVEL=INFO # DEBUG for verbose logs
WORKERS=4 # API worker processes
HEARTBEAT_INTERVAL=60 # Seconds between health checks
🔐 Security Checklist
- [ ] Change default ClickHouse password
- [ ] Store
.envsecurely (add to.gitignore) - [ ] Use separate API keys for prod/staging
- [ ] Enable ClickHouse authentication
- [ ] Configure AI Envoy Gateway policies
- [ ] Rotate API keys regularly
- [ ] Review ClickHouse access logs
- [ ] Set up alerting for unhealthy resources
📚 Resources
- API Docs: http://localhost:8000/docs
- ClickHouse UI: http://localhost:8124 (with
--profile debug) - Health Check: http://localhost:8000/health
- Full README: See README.md
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.