LLM API Benchmark MCP Server

LLM API Benchmark MCP Server

Enables benchmarking of Large Language Model APIs by measuring performance metrics such as generation throughput, prompt throughput, and Time To First Token (TTFT) with configurable concurrency levels and parameters.

Category
Visit Server

README

LLM API Benchmark MCP Server

This project provides an MCP (Model Context Protocol) server designed to benchmark Large Language Model (LLM) APIs. It allows you to measure various performance metrics such as generation throughput, prompt throughput, and Time To First Token (TTFT) for your LLM endpoints.

Features

  • Comprehensive Benchmarking: Measure key performance indicators like generation throughput, prompt throughput, and Time To First Token (TTFT).
  • Flexible Deployment: Supports both remote SSE server for quick trials and local deployment via stdio or SSE transport.
  • Customizable Benchmarks: Configure various parameters for your LLM API benchmarks, including concurrency levels, model names, token limits, and more.
  • Detailed Output: Provides structured JSON output with aggregated and distributed metrics for in-depth analysis.
  • Easy Integration: Designed as an MCP server for seamless integration with MCP clients.

Table of Contents

Prerequisites

This project uses uv for environment management. Please ensure you have uv installed on your system.

Install uv

Refer to the official uv documentation for installation methods.

Quick Start

You can experience this MCP server through a remote SSE server for quick trials or deploy it locally using either stdio or SSE transport.

Demo (Recommended for Simple Trials)

If you wish to quickly try out the MCP server, a public SSE server is available. You can configure your MCP client to connect to it directly.

Note: While this project does not collect your API endpoint or API key, please be cautious about potential exposure risks when using remote services.

MCP Configuration:

{
  "mcpServers": {
    "llm-benchmark-sse": {
      "url": "https://llm-api-benchmark.pikoo.de/sse"
    }
  }
}

Local Deployment (stdio)

This is the recommended method for local deployment.

MCP Configuration:

{
  "mcpServers": {
    "llm-benchmark-stdio":{
      "command": "uvx",
      "args": [
        "--refresh",
        "--quiet",
        "llm-api-benchmark-mcp-server"
      ]
    }
  }
}

Local Deployment (sse)

Alternatively, you can deploy the server locally using SSE transport.

  1. Clone the repository and navigate into it:

    git clone https://github.com/Yoosu-L/llm-api-benchmark-mcp-server.git
    cd llm-api-benchmark-mcp-server
    
  2. Modify src/llm_api_benchmark_mcp_server/main.py: Comment out the stdio transport section and uncomment the SSE transport section. You can also change the mcp.settings.port to your desired port.

  3. Build and Start the MCP Server (SSE):

    uv build
    uv tool install dist/llm_api_benchmark_mcp_server-0.1.3-py3-none-any.whl # path may varies
    llm-api-benchmark-mcp-server
    
  4. Configure MCP Client: Update your MCP client configuration to connect to your local SSE server.

    {
      "mcpServers": {
        "llm-benchmark-sse": {
          "url": "http://localhost:47564/sse"
        }
      }
    }
    

Usage Example

Once the MCP server is configured and running, you can use it to perform LLM API benchmarks.

Example Prompt:

Please help me perform a LLM api benchmark on this address with concurrency levels of 1 and 2. https://my-llm-api-service.com/v1, sk-xxx

Example Output from MCP Tools:

{
  "model_name": "gemini-2.0-flash",
  "input_tokens": 32,
  "max_tokens": 512,
  "latency_ms": 2.46,
  "benchmark_results": [
    {
      "concurrency": 1,
      "generation_throughput_tokens_per_s": {
        "total": 135.74,
        "avg": 135.74,
        "distribution": {
          "max": 135.74,
          "p50": 135.74,
          "p10": 135.74,
          "p1": 135.74,
          "min": 135.74
        }
      },
      "prompt_throughput_tokens_per_s": {
        "total": 8.48,
        "avg": 8.48,
        "distribution": {
          "max": 8.48,
          "p50": 8.48,
          "p10": 8.48,
          "p1": 8.48,
          "min": 8.48
        }
      },
      "ttft_s": {
        "avg": 0.41,
        "distribution": {
          "min": 0.41,
          "p50": 0.41,
          "p90": 0.41,
          "p99": 0.41,
          "max": 0.41
        }
      }
    },
    {
      "concurrency": 2,
      "generation_throughput_tokens_per_s": {
        "total": 247.6,
        "avg": 123.8,
        "distribution": {
          "max": 124.07,
          "p50": 123.53,
          "p10": 123.53,
          "p1": 123.53,
          "min": 123.53
        }
        
      },
      "prompt_throughput_tokens_per_s": {
        "total": 15.52,
        "avg": 7.76,
        "distribution": {
          "max": 7.78,
          "p50": 7.74,
          "p10": 7.74,
          "p1": 7.74,
          "min": 7.74
        }
      },
      "ttft_s": {
        "avg": 0.68,
        "distribution": {
          "min": 0.43,
          "p50": 0.43,
          "p90": 0.94,
          "p99": 0.94,
          "max": 0.94
        }
      }
    }
  ],
  "output_log": []
}

Benchmark Parameters

The run_llm_benchmark MCP tool accepts the following parameters:

  • base_url (string, required): Base URL of the OpenAI API endpoint (e.g., http://localhost:8000/v1).
  • api_key (string, optional, default: sk-dummy): API key for authentication.
  • model (string, optional, default: ""): Model to be used for the requests. If not provided, the server will attempt to discover the first available model.
  • prompt (string, optional, default: "Write a long story, no less than 10,000 words, starting from a long, long time ago."): Prompt to be used for generating responses. If num_words is provided and greater than 0, random input will be generated instead of using this prompt.
  • num_words (integer, optional, default: 0, minimum: 0): Number of words for random input. If greater than 0 and the default prompt is used, random input will be generated.
  • concurrency (string, optional, default: "1"): Comma-separated list of concurrency levels (e.g., "1,2,4,8").
  • max_tokens (integer, optional, default: 512, minimum: 1): Maximum number of tokens to generate.

License

This project is licensed under the MIT License.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
E2B

E2B

Using MCP to run code via e2b.

Official
Featured