Kubernetes + Prometheus SRE MCP Server
Enables natural language Kubernetes cluster operations, SLO monitoring, and PromQL queries via Claude using the Model Context Protocol.
README
π€ Kubernetes + Prometheus SRE MCP Server β natural language cluster ops, SLO monitoring, and PromQL queries via Claude
Natural language Kubernetes operations β powered by Model Context Protocol (MCP)
Built to scale from a single cluster to multi-cluster, multi-team enterprise environments.
π― What Is This?
An MCP (Model Context Protocol) server that exposes Kubernetes SRE operations as tools an AI assistant can call.
You: "Run the high error rate runbook for the production namespace"
Claude: [calls run_runbook β executes org-approved diagnosis sequence]
Step 1: Checked deployments β nginx (3/3), api-service (1/3 β οΈ)
Step 2: Found pod api-service-7f9d β 47 restarts, OOMKilled
Step 3: Warning events β OOMKilled x3 in last 10 minutes
Recommendation: Increase memory limit to 512Mi + scale to 5 replicas
β¨ What's New in v2.0
| Feature | v1 | v2 |
|---|---|---|
| Clusters supported | 1 (hardcoded) | Many (dynamic context switching) |
| Write operations | Unrestricted | Policy-checked with guardrails |
| Audit trail | None | Full structured JSON log |
| Incident diagnosis | Ad-hoc | Encoded runbooks (standardized) |
| Operational consistency | Per-engineer | Org-wide enforced |
π οΈ Tools
Read
| Tool | Description |
|---|---|
list_clusters |
All clusters in kubeconfig |
get_pods |
Pod status, restarts, container states |
get_crashlooping_pods |
CrashLoopBackOff pods across all namespaces |
get_pod_logs |
Logs including previous crashed container |
get_node_health |
Node readiness and pressure conditions |
get_deployments |
Desired vs ready vs available replicas |
get_events |
Warning events β key incident signal |
get_namespaces |
All namespaces |
Write (Policy-checked + Audit-logged)
| Tool | Policy Enforced |
|---|---|
scale_deployment |
Max replicas Β· Blocked namespaces Β· Prod minimums |
SRE Runbooks
| Tool | Description |
|---|---|
list_runbooks |
Available runbooks with triggers |
run_runbook |
Execute org-standard diagnosis sequence |
Governance
| Tool | Description |
|---|---|
get_audit_log |
All recent operations with timestamps |
ποΈ Architecture
Claude Desktop (MCP Host)
β
β MCP Protocol (stdio / JSON-RPC)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β SRE MCP Server v2 β
β server.py β entry point β
β cluster_manager β multi-cluster β
β policy.py β write guards β
β audit.py β JSON audit log β
β runbooks.py β SRE runbooks β
ββββββββββββββββ¬βββββββββββββββββββββββ
β kubernetes Python SDK
βΌ
ββββββββββββββββββββββββ
β Kubernetes Clusters β
β (any kubeconfig β
β context) β
ββββββββββββββββββββββββ
π Quick Start
git clone https://github.com/ManishMaurya22/sre-mcp-server
cd sre-mcp-server
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"sre-k8s": {
"command": "/Users/<YOUR_USERNAME>/sre-mcp-server/venv/bin/python",
"args": ["/Users/<YOUR_USERNAME>/sre-mcp-server/server.py"]
}
}
}
See docs/SETUP.md for full setup guide.
π Policy Configuration
export POLICY_MAX_REPLICAS=30
export POLICY_SCALE_BLOCKED_NS="kube-system,gatekeeper-system"
export POLICY_PROD_NAMESPACES="production,prod"
export POLICY_PROD_MIN_REPLICAS=2
You: "Scale nginx to 0 in production"
Claude: β Policy Denied β scaling to 0 not allowed in production (min: 2)
Operation audit-logged.
π Encoded Runbooks
Available: high_error_rate Β· node_pressure Β· deployment_rollback
You: "Run the high_error_rate runbook for production"
Claude runs in order:
1. get_deployments β spot unhealthy deployments
2. get_pods β check restart counts
3. get_events β surface warning signals
4. get_crashlooping_pods β cluster-wide check
+ surfaces remediation hints
ποΈ Structure
sre-mcp-server/
βββ server.py # Main MCP server
βββ cluster_manager.py # Multi-cluster context management
βββ policy.py # Write operation guardrails
βββ audit.py # Structured audit trail
βββ runbooks.py # Encoded SRE runbooks
βββ requirements.txt
βββ tools/k8s_tools.py
βββ config/claude_desktop_config.example.json
βββ docs/
β βββ SETUP.md
β βββ INTERVIEW_GUIDE.md
βββ .github/workflows/ci.yaml
πΊοΈ Roadmap
- [ ] Prometheus MCP β SLO burn rate queries
- [ ] PagerDuty MCP β incident acknowledgement
- [ ] ArgoCD MCP β GitOps sync and triggers
- [ ] Central MCP Gateway β auth + multi-team routing
π License
MIT β See LICENSE
Built by Manish Maurya β DevOps/SRE Leader | 16+ Years | Abu Dhabi, UAE Website: https://manishmaurya22.github.io/
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.