slurm-mcp

slurm-mcp

An MCP server that gives AI coding assistants direct access to Slurm HPC clusters for job submission, file management, and shell access.

Category
Visit Server

README

slurm-mcp

An MCP (Model Context Protocol) server that gives AI coding assistants like Claude Code direct access to Slurm HPC clusters.

The server runs on your cluster's login node and exposes Slurm operations, file management, and shell access as MCP tools — letting Claude submit jobs, monitor GPU availability, read logs, and manage files through natural conversation.

Features

  • Job Management — submit (sbatch), list (squeue), cancel (scancel), status (sacct), and tail output
  • Job Watcher — background polling that records terminal state in-process (inspect via list_watches)
  • Preamble Injection — prepend module loads / env setup into every inline job script
  • Auto-QOS — automatic --qos=hpgpu when targeting partitions that require it
  • File Operations — read, write, edit, search, and delete files with storage policy enforcement
  • Cluster Info — partition overview, node states, GPU availability
  • Shell Access — run arbitrary commands with safety guardrails
  • Git Sync — pull latest code to the cluster
  • Storage Policy — warns when data files (checkpoints, datasets, etc.) target quota-limited directories

Quick Start

1. Setup on the cluster

git clone https://github.com/dongwookim-ml/slurm-mcp.git
cd slurm-mcp
bash setup.sh

2. Configure Claude Code on your local machine

Add to ~/.claude.json:

{
  "mcpServers": {
    "slurm": {
      "command": "ssh",
      "args": ["user@cluster-host",
               "cd /path/to/slurm-mcp && .venv/bin/python server.py"]
    }
  }
}

Replace user@cluster-host and /path/to/slurm-mcp with your values. SSH key-based auth is required (no password prompts).

3. Use it

Once configured, Claude Code can directly interact with your cluster:

  • "Submit a training job on 4 GPUs"
  • "Check my running jobs"
  • "Show me the last 100 lines of job 12345's output"
  • "What GPUs are available right now?"
  • "Find all .py files under my project directory"

Tools

Category Tools
Slurm Jobs submit_job, list_jobs, cancel_job, job_status, tail_output
Watchers watch_job, list_watches
File Ops read_file, write_file, edit_file, search_files, delete_file, disk_usage
System run_command, sync_code, cluster_info

Configuration

Targeted at the ai2 HPC cluster — partition names and QOS rules are baked into the code (see HPGPU_PARTITIONS in server.py and the QOS policy notes in CLAUDE.md). Paths below are configurable, but the cluster-specific assumptions are not.

Variable Default Description
SLURM_MCP_HOME_DIR /home1/$USER Home directory (quota-limited)
SLURM_MCP_DATA_DIR /home/$USER Data storage directory
SLURM_MCP_SCRATCH_DIR /scratch Temporary staging area
SLURM_MCP_HOME_QUOTA_GB 500 Home quota threshold for warnings
SLURM_MCP_PREAMBLE (empty) Shell lines injected after the shebang into inline job scripts (e.g. module load cuda/12.1\nsource ~/.venv/bin/activate)

Auto-QOS

When submit_job targets a partition in {A100-40GB, A100-80GB, 4A100} and no --qos appears in extra_args, --qos=hpgpu is added automatically. Pass --qos=<other> in extra_args to override.

Watchers

watch_job <id> registers an async watcher that polls squeue (falling back to sacct) every 30 s (configurable). When the job reaches a terminal state (COMPLETED, FAILED, TIMEOUT, CANCELLED, OUT_OF_MEMORY, …) the final state and a summary are stored in the in-process watcher registry. Use list_watches to inspect. Watchers live in-process and are lost if the server restarts.

Set these in your shell profile or pass them when running the server:

SLURM_MCP_HOME_DIR=/home/myuser SLURM_MCP_DATA_DIR=/data/myuser .venv/bin/python server.py

Requirements

  • Python 3.10+
  • Slurm cluster with CLI tools (sbatch, squeue, sacct, sinfo, scancel)
  • SSH key-based access to the cluster
  • mcp Python package (installed automatically by setup.sh)

How It Works

The server is a single Python file (server.py) using the FastMCP framework. It runs on the cluster login node and wraps Slurm CLI commands as async MCP tools. Claude Code connects to it over SSH using the stdio transport.

Storage policy enforcement is built in — when you write files, the server checks if data files (model checkpoints, datasets, archives, etc.) are targeting a quota-limited home directory and suggests the data directory instead.

License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured