Kubeflow MCP Server

Kubeflow MCP Server

Enables AI agents to plan, submit, monitor, and manage Kubeflow training jobs through natural language, without needing to learn Kubernetes or the Kubeflow SDK.

Category
Visit Server

README

Kubeflow MCP Server

License Python Join Slack

Proposal: KEP-936 · ROADMAP · SECURITY · CONTRIBUTING

Overview

The Kubeflow MCP Server exposes Kubeflow Training operations as Model Context Protocol tools, enabling AI agents (Claude, Cursor, Claude Code, or any custom agents etc.) to plan, submit, monitor, and manage training jobs through natural language — without users needing to learn Kubernetes or the Kubeflow SDK directly.

Benefits

  • Agent-Native: Tools auto-discovered via MCP — no manual API wiring
  • Guided Workflow: Phase ordering with next-step hints (Plan → Discover → Train → Monitor)
  • Preview-Before-Submit: Every mutating operation requires explicit confirmation
  • Security-First: Persona gating, namespace enforcement, input validation, bearer/JWT auth
  • Multi-Platform: Auto-detects OpenShift, EKS, GKE with platform-specific guidance
  • Token-Efficient: Progressive/semantic modes compress 23 tools into 2-3 meta-tools
  • Extensible: Plugin architecture for additional Kubeflow clients (TODO: optimizer, hub)

Get Started

Install from source

git clone https://github.com/kubeflow/mcp-server.git
cd mcp-server
pip install .

Run the server

kubeflow-mcp serve

Once published to PyPI, install with pip install kubeflow-mcp.

Example: Fine-tune a model via AI agent

Once connected, your AI agent can run a complete training workflow through natural language:

User: "Fine-tune gemma-2b on the alpaca dataset"

Agent calls: check_compatibility()        → ✅ K8s 1.29, Trainer CRD installed
Agent calls: get_cluster_resources()      → 4x A100 GPUs available
Agent calls: estimate_resources("google/gemma-2b") → needs ~16GB GPU, 1x A100
Agent calls: list_runtimes()              → torchtune-llama, torchtune-gemma, ...
Agent calls: fine_tune(                   → preview config (confirmed=False)
    model="hf://google/gemma-2b",
    dataset="hf://tatsu-lab/alpaca",
    runtime="torchtune-gemma-2b"
)
Agent calls: fine_tune(..., confirmed=True) → TrainJob "train-gemma-abc" created
Agent calls: get_training_logs("train-gemma-abc") → training progress...

Every mutating tool requires confirmed=True — agents always preview before submitting.

MCP Client Config

<details> <summary>Cursor</summary>

Add to .cursor/mcp.json (or use the .mcp.json at the repo root for local dev):

{
  "mcpServers": {
    "kubeflow": {
      "command": "uv",
      "args": ["run", "kubeflow-mcp", "serve"]
    }
  }
}

</details>

<details> <summary>Claude Code</summary>

claude mcp add kubeflow -- kubeflow-mcp serve

</details>

Tools

23 tools organized by workflow phase:

Phase Tools Description
Planning pre_flight, check_compatibility, get_cluster_resources, estimate_resources Environment validation and resource estimation
Discovery list_training_jobs, get_training_job, list_runtimes, get_runtime Browse jobs and available runtimes
Training fine_tune, run_custom_training, run_container_training Submit LoRA/QLoRA fine-tuning, custom scripts, or container jobs
Monitoring get_training_logs, get_training_events, wait_for_training Track progress, debug failures
Lifecycle delete_training_job, update_training_job Manage existing jobs (ownership-guarded)
Platform inspect_crd, inspect_controller, patch_runtime, create_runtime, delete_runtime Cluster inspection and runtime management
Health health_check, get_server_logs Server diagnostics

Requirements

MCP Server Kubeflow Trainer Kubeflow SDK Python Kubernetes
0.1.x >= 2.2.0 >= 0.4.0 3.10 - 3.12 >= 1.27

CLI Reference

kubeflow-mcp serve

kubeflow-mcp serve \
  --clients trainer \             # modules: trainer, optimizer (stub), hub (stub)
  --persona ml-engineer \         # readonly | data-scientist | ml-engineer | platform-admin
  --mode full \                   # full | progressive | semantic
  --instruction-tier full \       # full | compact | minimal
  --transport stdio \             # stdio | http | sse
  --auth-token SECRET \           # bearer token for HTTP auth (dev/staging)
  --log-level INFO \              # DEBUG | INFO | WARNING | ERROR
  --log-format console \          # console | json (auto-detected if omitted)
  --no-banner                     # suppress startup banner

--mode progressive exposes 3 meta-tools (~85 tokens) for hierarchical discovery. --mode semantic exposes 2 meta-tools (~69 tokens) using embedding search. Both reduce token consumption significantly for agent workflows.

<details> <summary> HTTP Authentication</summary>

When using --transport http, configure auth to secure the endpoint:

# Simple API key (dev/staging)
kubeflow-mcp serve --transport http --auth-token my-secret-token

# Or via env var
export KUBEFLOW_MCP_AUTH_TOKEN=my-secret-token
kubeflow-mcp serve --transport http

# JWT verification (production)
export KUBEFLOW_MCP_JWKS_URI=https://auth.example.com/.well-known/jwks.json
export KUBEFLOW_MCP_JWT_ISSUER=https://auth.example.com
export KUBEFLOW_MCP_JWT_AUDIENCE=kubeflow-mcp
kubeflow-mcp serve --transport http

Without auth configured, the server logs a warning that the HTTP endpoint is open.

</details>

<details> <summary>Agent Subcommand</summary>

kubeflow-mcp agent \
  --backend ollama \              # ollama (default; more backends planned)
  --model qwen3:8b \              # model name for the backend
  --mode full \                   # full | progressive | semantic
  --thinking                      # enable thinking output (supported models)

</details>

Development

make install-dev                  # setup environment
make verify                       # lint + format check
make test-python                  # run tests
make inspector                    # launch MCP Inspector (stdio)
make inspector TRANSPORT=http     # Inspector + Streamable HTTP (start server separately)
make inspector TRANSPORT=sse      # Inspector + SSE (start server separately)

Community

Documentation

License

Apache License 2.0 — see LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured