ppb-mcp
Exposes queryable GPU inference benchmark data (quantization, throughput, VRAM, concurrent users) as tools for LLM clients.
README
ppb-mcp
An MCP server that exposes Poor Paul's Benchmark GPU inference data — quantization × throughput × VRAM × concurrent users — as queryable tools to any LLM client.
Hosted instance: https://mcp.poorpaul.dev/ (streamable-http transport, no auth)
What it does
Connect any MCP-aware client (Claude Desktop, Cline, Continue, etc.) to ask questions like:
- "What's the best quantization for a 32 GB GPU running Qwen3.5-9B with 8 concurrent users?"
- "Show me every model tested at Q4_K_M on the RTX 5090."
- "Will Llama-13B at Q5_K_M fit on a 24 GB GPU at 4 concurrent users?"
It exposes four tools backed by 30,000+ real benchmark rows:
| Tool | What it does |
|---|---|
list_tested_configs |
Lists every tested GPU, model, and quantization (call this first) |
query_ppb_results |
Filters raw benchmark rows by GPU / VRAM / model / quant / users / backend |
recommend_quantization |
Three-tier empirical-first recommendation engine (high / medium / low confidence) |
get_gpu_headroom |
Sanity-checks a (gpu, model, quant, users) configuration for VRAM headroom |
Install
1) Use the hosted instance (zero setup)
Add to your MCP client config (Claude Desktop example, ~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"ppb": {
"transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
}
}
}
2) pip install and run locally (stdio)
pip install ppb-mcp
MCP_TRANSPORT=stdio ppb-mcp
Claude Desktop config:
{
"mcpServers": {
"ppb": {
"command": "ppb-mcp",
"env": { "MCP_TRANSPORT": "stdio" }
}
}
}
3) Docker
docker run --rm -p 9933:9933 \
-e MCP_TRANSPORT=streamable-http \
-v ppb-hf-cache:/data/huggingface \
ghcr.io/paulplee/ppb-mcp:latest
4) From source
git clone https://github.com/paulplee/ppb-mcp
cd ppb-mcp
pip install -e ".[dev]"
ppb-mcp # streamable-http on :9933
Connect Your LLM Client
All clients use the same hosted endpoint: https://mcp.poorpaul.dev/mcp
Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS)
or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"ppb": {
"transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
}
}
}
Restart Claude Desktop after saving.
Cursor
Edit ~/.cursor/mcp.json (create if it doesn't exist):
{
"mcpServers": {
"ppb": {
"url": "https://mcp.poorpaul.dev/mcp",
"type": "http"
}
}
}
Or via UI: Settings → Tools & Integrations → MCP → Add Server.
Windsurf
Edit ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"ppb": {
"serverUrl": "https://mcp.poorpaul.dev/mcp",
"transport": "http"
}
}
}
VS Code (GitHub Copilot Agent Mode)
Add to your .vscode/mcp.json (workspace) or User settings.json:
{
"mcp": {
"servers": {
"ppb": {
"type": "http",
"url": "https://mcp.poorpaul.dev/mcp"
}
}
}
}
Zed
Add to ~/.config/zed/settings.json under "context_servers":
{
"context_servers": {
"ppb": {
"command": {
"path": "env",
"args": ["MCP_TRANSPORT=stdio", "uvx", "ppb-mcp"]
}
}
}
}
Cline (VS Code extension)
Open the Cline panel → MCP Servers tab → Add Server → select SSE/HTTP → paste https://mcp.poorpaul.dev/mcp.
Continue.dev
Add to ~/.continue/config.yaml:
mcpServers:
- name: ppb
transport:
type: http
url: https://mcp.poorpaul.dev/mcp
OpenCode
Add to ~/.config/opencode/config.json:
{
"mcp": {
"ppb": {
"type": "remote",
"url": "https://mcp.poorpaul.dev/mcp"
}
}
}
Goose (Block)
goose mcp add ppb --transport http --url https://mcp.poorpaul.dev/mcp
Any stdio-compatible client
# Zero-install (requires uv):
env MCP_TRANSPORT=stdio uvx ppb-mcp
# After pip install:
env MCP_TRANSPORT=stdio ppb-mcp
Note on transport key names: MCP clients are not yet fully standardised on JSON key names for the HTTP transport. If your client doesn't connect with
"type": "http", try"transport": "http","type": "sse", or"transport": "streamable-http". The endpoint URL is the same regardless.
Example session
> list_tested_configs
{ "gpus": ["Apple M4 Pro", "NVIDIA GB10", "NVIDIA GeForce RTX 5090"],
"models": ["Qwen3.5-9B", ...], "quantizations": ["Q4_K_M", ...] }
> recommend_quantization(gpu_vram_gb=32, concurrent_users=8, model="Qwen3.5-9B", priority="balance")
{ "recommended_quantization": "Q5_K_M",
"estimated_vram_usage_gb": 27.8,
"estimated_tokens_per_second": 142.0,
"headroom_gb": 4.2,
"confidence": "high",
"reasoning": "Q5_K_M is recommended for your NVIDIA GeForce RTX 5090 (32 GB) ...",
"alternatives": ["Q4_K_M", "Q8_0"] }
Configuration
| Env var | Default | Notes |
|---|---|---|
HF_DATASET |
paulplee/ppb-results |
HuggingFace dataset ID |
REFRESH_INTERVAL_HOURS |
1 |
Background refresh cadence |
MCP_TRANSPORT |
streamable-http |
stdio or streamable-http |
HOST |
0.0.0.0 |
HTTP bind host |
PORT |
9933 |
HTTP bind port |
LOG_LEVEL |
INFO |
Python logging level |
Self-hosting (Lightsail / any Ubuntu VPS)
git clone https://github.com/paulplee/ppb-mcp /tmp/ppb-mcp
cd /tmp/ppb-mcp
DOMAIN=mcp.example.com EMAIL=you@example.com ./deploy/deploy.sh
This installs Docker, builds the image, registers a systemd unit, configures nginx, and runs certbot.
Development
pip install -e ".[dev]"
ruff check src tests
pytest -v
Integration tests against the live HuggingFace dataset are gated behind PPB_RUN_INTEGRATION=1 to keep CI offline-clean.
How recommendations work
- Tier 1 — empirical exact match (high confidence). ≥3 measured runs on a GPU at-or-below your VRAM budget at the requested concurrency.
- Tier 2 — empirical-near (medium). Same
(model, quant)benchmarked on a different GPU at the same concurrency; throughput borrowed, VRAM scaled to your card. - Tier 3 — formula extrapolation (low).
vram_per_user ≈ (params_B × bits_per_weight / 8) × 1.15; viable iff total ≤ 90 % of your VRAM.
License
MIT — see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
E2B
Using MCP to run code via e2b.