ArmBench MCP Server
Enables benchmarking and inference of LLMs on Arm64 cloud instances with KleidiAI optimizations, providing an MCP-compatible API for serving results.
README
ā” ArmBench ā Arm64 LLM Inference Benchmark Suite + MCP Server
KleidiAI-optimized LLM benchmarking and inference server for Arm64 cloud infrastructure. Built for the Arm AI Optimization Challenge 2026.
šÆ What is ArmBench?
ArmBench is a one-command benchmarking tool that:
- Deploys LLMs (Llama 3.2) on Arm64 cloud instances using llama.cpp + KleidiAI
- Measures real performance ā tokens/sec, time-to-first-token, memory usage across quantization levels (Q4_K_M vs Q8_0)
- Serves results via an MCP-compatible FastAPI server any agent framework can call
- Visualizes everything in a clean real-time dashboard
šļø Architecture
armbench/
āāā benchmark/ # llama.cpp + KleidiAI inference engine + metrics
āāā mcp_server/ # FastAPI MCP-compatible LLM endpoint
āāā dashboard/ # Real-time results dashboard (HTML)
āāā scripts/ # One-command setup + benchmark + server scripts
āāā docker/ # Arm64-optimized Docker configuration
š Quick Start (Arm64 Instance)
1. Clone and setup
git clone https://github.com/sirmos/armbench.git
cd armbench
bash scripts/setup.sh
2. Run benchmark
bash scripts/run_benchmark.sh
3. Start MCP server
bash scripts/start_mcp.sh
4. Open dashboard
Navigate to http://your-instance-ip:8000 in your browser.
āļø Tested Arm64 Platforms
| Platform | Instance | Arm CPU |
|---|---|---|
| Oracle Cloud | VM.Standard.A1.Flex | Ampere Altra |
| AWS | c7g.large | Graviton3 |
| GCP | c4a-standard-4 | Axion |
š What We Benchmark
| Metric | Description |
|---|---|
| Tokens/sec | Inference throughput |
| Time to First Token | Latency from prompt to first output token |
| Memory (MB) | RAM consumed during inference |
| Model size (GB) | Disk footprint per quantization level |
Models
| Model | Quant | Size | Use case |
|---|---|---|---|
| Llama-3.2-3B-Instruct | Q4_K_M | 1.9 GB | Speed-optimized |
| Llama-3.2-3B-Instruct | Q8_0 | 3.4 GB | Quality-optimized |
š MCP Server API
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Server info |
/health |
GET | Health + platform info |
/models |
GET | List available models |
/generate |
POST | Run inference |
/benchmark |
POST | Full benchmark suite |
/mcp/tools |
GET | MCP-compatible tools listing |
/docs |
GET | Interactive API docs |
Example: Generate
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "What is KleidiAI?", "model": "Llama-3.2-3B-Q4_K_M"}'
āļø Arm-Specific Optimizations
- KleidiAI: Arm's optimized kernel library for ML workloads
- llama.cpp Arm SVE: Scalable Vector Extension support enabled at build time
- Native CPU tuning:
-DLLAMA_NATIVE=ONcompiles for exact CPU microarchitecture - Thread optimization: Automatically uses all available Arm cores
š License
MIT License ā see LICENSE
Built for the Arm AI Optimization Challenge 2026
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.