nsys-mcp
Enables GPU profiling and performance analysis via NVIDIA Nsight Systems, allowing agents to profile binaries and aggregate statistics for kernels, memory copies, and NVTX ranges. It supports advanced analysis through interval tree construction and structural queries on profiling reports.
README
<p align="center"> <img src="assets/nvidia-nsight-systems-icon-gbp-shaded-256.png" alt="Nsight Systems logo" width="128"> </p>
<h3 align="center">nsys MCP Server</h3>
<p align="center"> <code>MCP</code> · <code>GPU Profiling</code> · <code>NVIDIA Nsight Systems</code> · <code>LLM Agents</code> </p>
nsys-mcp is an MCP (Model Context Protocol) server that
provides GPU profiling capabilities through NVIDIA Nsight Systems (nsys).
It lets an LLM agent profile binaries, parse reports, compute statistics, and
analyze interval trees — all via standard MCP tool calls.
Prerequisites
- Python 3.10+
- NVIDIA Nsight Systems (
nsys) installed and available inPATH. Download from the Nsight Systems page. See the Nsight Systems documentation for setup details.
Installation
pip install -e .
For development (tests):
pip install -e ".[dev]"
Running the Server
The server communicates over stdio (the default MCP transport):
python -m nsys_mcp.server
Cursor / VS Code MCP configuration
Add to your MCP settings (e.g. .cursor/mcp.json):
{
"mcpServers": {
"nsys-profiler": {
"command": "python",
"args": ["-m", "nsys_mcp.server"]
}
}
}
Available Tools
The server exposes 10 tools:
| # | Tool | Description |
|---|---|---|
| 1 | check_nsys |
Verify that nsys is installed and return its version |
| 2 | profile_binary |
Profile a binary with full CUDA, NVTX, and GPU metrics collection |
| 3 | load_report |
Load a pre-existing .nsys-rep or NDJSON .json file |
| 4 | list_reports |
List all cached profiling reports with metadata |
| 5 | get_event_summary |
Breakdown of event types and counts for a report |
| 6 | get_kernel_stats |
Aggregate GPU kernel statistics grouped by kernel name |
| 7 | get_nvtx_stats |
Aggregate NVTX range durations grouped by annotation text |
| 8 | get_memcpy_stats |
Aggregate memory copy statistics grouped by direction |
| 9 | build_interval_tree |
Construct an interval tree from profiling events |
| 10 | query_interval_tree |
Run structural queries against an interval tree |
profile_binary
Profile a binary with full CUDA, NVTX, and GPU metrics collection. Results are cached so repeated calls with the same arguments skip re-profiling.
| Parameter | Type | Description |
|---|---|---|
binary |
str |
Path to the executable |
args |
list[str] |
Command-line arguments (optional) |
env |
dict[str, str] |
Extra environment variables (optional) |
cwd |
str |
Working directory (optional) |
duration |
int |
Max profiling duration in seconds (optional) |
extra_nsys_flags |
list[str] |
Additional nsys flags (optional) |
Returns report_id, event_counts, and time_span_ns.
load_report
Load a pre-existing .nsys-rep or NDJSON .json file without re-profiling.
| Parameter | Type | Description |
|---|---|---|
path |
str |
Path to .nsys-rep or .json file |
get_event_summary
Get a breakdown of event types and counts for a report.
| Parameter | Type | Description |
|---|---|---|
report_id |
str |
ID from profile_binary or load_report |
get_kernel_stats
Aggregate GPU kernel statistics grouped by kernel name. Includes duration statistics (mean, std, min, max, median, count, total) and GPU metrics (grid/block size, shared memory, registers).
| Parameter | Type | Description |
|---|---|---|
report_id |
str |
Report identifier |
top_n |
int |
Limit to top N kernels (optional) |
sort_by |
str |
total_ns, count, mean_ns, or max_ns (default: total_ns) |
get_nvtx_stats
Aggregate NVTX range durations grouped by annotation text.
| Parameter | Type | Description |
|---|---|---|
report_id |
str |
Report identifier |
domain_id |
int |
Filter by NVTX domain (optional) |
get_memcpy_stats
Aggregate memory copy statistics grouped by copy direction (HtoD, DtoH, DtoD, etc.). Includes duration stats, total bytes, and bandwidth estimates.
| Parameter | Type | Description |
|---|---|---|
report_id |
str |
Report identifier |
build_interval_tree
Construct an interval tree from profiling events. If multiple disjoint trees exist (a forest), they can be merged under a synthetic root.
| Parameter | Type | Description |
|---|---|---|
report_id |
str |
Report identifier |
event_types |
list[str] |
Subset of ["kernel", "nvtx", "trace", "memcpy", "sync"] (default: all) |
reduce_forest |
bool |
Merge forest into single tree (default: true) |
thread_id |
int |
Filter by thread/stream ID (optional) |
query_interval_tree
Run structural queries against a previously built interval tree.
| Parameter | Type | Description |
|---|---|---|
report_id |
str |
Report identifier |
query_type |
str |
One of the query types below |
event_name |
str |
Event name for count_calls |
subtree_root_name |
str |
Scope query to a named subtree (optional) |
max_depth |
int |
Limit traversal depth (optional) |
Query types:
| Type | Description |
|---|---|
most_time_consuming |
Find the longest-duration event in a subtree |
top_level |
List top-level interval names |
count_calls |
Count occurrences of a named event in a subtree |
subtree_summary |
Aggregated stats for a named subtree |
Typical Workflow
1. check_nsys() — verify nsys is available
2. profile_binary(binary="/app/solver", ...) — profile and get report_id
3. get_kernel_stats(report_id, top_n=10) — see top 10 kernels
4. get_nvtx_stats(report_id) — see NVTX annotation timings
5. get_memcpy_stats(report_id) — see memory transfer stats
6. build_interval_tree(report_id) — build the tree
7. query_interval_tree(report_id, — find bottleneck
query_type="most_time_consuming")
8. query_interval_tree(report_id, — count specific kernel calls
query_type="count_calls",
event_name="cub::DeviceReduce")
Caching
Profiling results are cached in two tiers:
- In-memory LRU — fast access for the current session (up to 8 reports).
- Disk — persists across server restarts at
~/.nsys_mcp/cache/.
Cache keys are derived from the binary path and arguments, so identical profiling runs reuse cached results automatically.
Testing
pip install -e ".[dev]"
pytest
Project Structure
src/nsys_mcp/
├── server.py # FastMCP server, tool definitions, lifespan
├── nsys_runner.py # nsys CLI wrapper (profile, export, version)
├── report_parser.py # NDJSON streaming parser, string-table resolution
├── models.py # Pydantic models for events, stats, configs
├── aggregator.py # Group-by aggregation (mean, std, min, max, count)
├── interval_tree.py # Interval tree/forest construction + queries
└── cache.py # Two-tier cache (memory LRU + disk pickle)
Links
License
nsys-mcp is licensed under the MIT License.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.