Kubernetes + Prometheus SRE MCP Server

Kubernetes + Prometheus SRE MCP Server

Enables natural language Kubernetes cluster operations, SLO monitoring, and PromQL queries via Claude using the Model Context Protocol.

Category
Visit Server

README

πŸ€– Kubernetes + Prometheus SRE MCP Server β€” natural language cluster ops, SLO monitoring, and PromQL queries via Claude

Natural language Kubernetes operations β€” powered by Model Context Protocol (MCP)
Built to scale from a single cluster to multi-cluster, multi-team enterprise environments.

Python MCP Kubernetes License: MIT


🎯 What Is This?

An MCP (Model Context Protocol) server that exposes Kubernetes SRE operations as tools an AI assistant can call.

You:    "Run the high error rate runbook for the production namespace"

Claude: [calls run_runbook β†’ executes org-approved diagnosis sequence]
        Step 1: Checked deployments β€” nginx (3/3), api-service (1/3 ⚠️)
        Step 2: Found pod api-service-7f9d β€” 47 restarts, OOMKilled
        Step 3: Warning events β€” OOMKilled x3 in last 10 minutes
        Recommendation: Increase memory limit to 512Mi + scale to 5 replicas

✨ What's New in v2.0

Feature v1 v2
Clusters supported 1 (hardcoded) Many (dynamic context switching)
Write operations Unrestricted Policy-checked with guardrails
Audit trail None Full structured JSON log
Incident diagnosis Ad-hoc Encoded runbooks (standardized)
Operational consistency Per-engineer Org-wide enforced

πŸ› οΈ Tools

Read

Tool Description
list_clusters All clusters in kubeconfig
get_pods Pod status, restarts, container states
get_crashlooping_pods CrashLoopBackOff pods across all namespaces
get_pod_logs Logs including previous crashed container
get_node_health Node readiness and pressure conditions
get_deployments Desired vs ready vs available replicas
get_events Warning events β€” key incident signal
get_namespaces All namespaces

Write (Policy-checked + Audit-logged)

Tool Policy Enforced
scale_deployment Max replicas Β· Blocked namespaces Β· Prod minimums

SRE Runbooks

Tool Description
list_runbooks Available runbooks with triggers
run_runbook Execute org-standard diagnosis sequence

Governance

Tool Description
get_audit_log All recent operations with timestamps

πŸ—οΈ Architecture

Claude Desktop (MCP Host)
       β”‚
       β”‚  MCP Protocol (stdio / JSON-RPC)
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         SRE MCP Server v2           β”‚
β”‚  server.py        ← entry point     β”‚
β”‚  cluster_manager  ← multi-cluster   β”‚
β”‚  policy.py        ← write guards    β”‚
β”‚  audit.py         ← JSON audit log  β”‚
β”‚  runbooks.py      ← SRE runbooks    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚  kubernetes Python SDK
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Kubernetes Clusters β”‚
    β”‚  (any kubeconfig     β”‚
    β”‚   context)           β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

git clone https://github.com/ManishMaurya22/sre-mcp-server
cd sre-mcp-server
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "sre-k8s": {
      "command": "/Users/<YOUR_USERNAME>/sre-mcp-server/venv/bin/python",
      "args": ["/Users/<YOUR_USERNAME>/sre-mcp-server/server.py"]
    }
  }
}

See docs/SETUP.md for full setup guide.


πŸ” Policy Configuration

export POLICY_MAX_REPLICAS=30
export POLICY_SCALE_BLOCKED_NS="kube-system,gatekeeper-system"
export POLICY_PROD_NAMESPACES="production,prod"
export POLICY_PROD_MIN_REPLICAS=2
You:    "Scale nginx to 0 in production"
Claude: ❌ Policy Denied β€” scaling to 0 not allowed in production (min: 2)
        Operation audit-logged.

πŸ“‹ Encoded Runbooks

Available: high_error_rate Β· node_pressure Β· deployment_rollback

You: "Run the high_error_rate runbook for production"

Claude runs in order:
  1. get_deployments    β†’ spot unhealthy deployments
  2. get_pods           β†’ check restart counts
  3. get_events         β†’ surface warning signals
  4. get_crashlooping_pods β†’ cluster-wide check
  + surfaces remediation hints

πŸ—‚οΈ Structure

sre-mcp-server/
β”œβ”€β”€ server.py              # Main MCP server
β”œβ”€β”€ cluster_manager.py     # Multi-cluster context management
β”œβ”€β”€ policy.py              # Write operation guardrails
β”œβ”€β”€ audit.py               # Structured audit trail
β”œβ”€β”€ runbooks.py            # Encoded SRE runbooks
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ tools/k8s_tools.py
β”œβ”€β”€ config/claude_desktop_config.example.json
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ SETUP.md
β”‚   └── INTERVIEW_GUIDE.md
└── .github/workflows/ci.yaml

πŸ—ΊοΈ Roadmap

  • [ ] Prometheus MCP β€” SLO burn rate queries
  • [ ] PagerDuty MCP β€” incident acknowledgement
  • [ ] ArgoCD MCP β€” GitOps sync and triggers
  • [ ] Central MCP Gateway β€” auth + multi-team routing

πŸ“„ License

MIT β€” See LICENSE


Built by Manish Maurya β€” DevOps/SRE Leader | 16+ Years | Abu Dhabi, UAE Website: https://manishmaurya22.github.io/

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured