mcp-eval-harness

mcp-eval-harness

MCP-based code evaluation harness — sandboxed execution + LLM quality scoring.

Category
Visit Server

README

mcp-eval-harness

MCP-based code evaluation harness — sandboxed execution + LLM quality scoring.

An MCP server that exposes 4 evaluation tools (no AI on the server itself — it's just a tool provider). Any MCP-compatible client can connect and use these tools: the included AI agent, Cursor, Claude Desktop, or anything that speaks MCP.

Tools

Tool Description
run_code Execute code in a sandboxed Docker container (or local fallback). Returns stdout, stderr, exit code.
run_tests Run code against test cases. Returns pass/fail for each test with actual vs expected output.
score_answer LLM-based quality scoring (0-10) against a rubric with breakdown by correctness, efficiency, readability, edge cases.
detect_hallucination Check if text contains claims not supported by provided context. Returns confidence and unsupported claims.

Architecture

┌─────────────────────┐       MCP (stdio)       ┌─────────────────────┐
│                     │  ◄───────────────────── │                     │
│   MCP Client        │       tool calls         │   MCP Server        │
│   (AI agent)        │  ────────────────────► │   (tool provider)   │
│                     │       results            │                     │
│   Has the brain —   │                          │   No AI here —      │
│   decides what to   │                          │   just runs code,   │
│   evaluate & when   │                          │   executes tests,   │
│                     │                          │   calls scoring APIs │
└─────────────────────┘                          └─────────────────────┘

Server = pure tool provider. Exposes run_code, run_tests, score_answer, detect_hallucination over MCP stdio. No decision-making, no orchestration.

Client = the brain. An AI agent (GPT-4o-mini) that decides which tools to call, in what order, and synthesizes a final evaluation report.

Transport = stdio. The client spawns the server as a subprocess and communicates via stdin/stdout. This is how MCP stdio works — same as how Cursor or Claude Desktop talk to MCP servers.

Use with Any MCP Client

The server isn't locked to our AI agent. You can use it with:

Cursor — Add to your MCP config:

{
  "mcpServers": {
    "eval-harness": {
      "command": "npx",
      "args": ["tsx", "server/index.ts"],
      "cwd": "/path/to/mcp-eval-harness"
    }
  }
}

Claude Desktop — Add to claude_desktop_config.json:

{
  "mcpServers": {
    "eval-harness": {
      "command": "npx",
      "args": ["tsx", "server/index.ts"],
      "cwd": "/path/to/mcp-eval-harness"
    }
  }
}

MCP Inspector (for debugging):

npx @modelcontextprotocol/inspector npx tsx server/index.ts

<!-- Yes, this uses Docker not Firecracker microVMs. It's a demo, not a bank. In prod, use E2B or self-hosted Firecracker. -->

Stack

Package Purpose
@modelcontextprotocol/sdk MCP server + client protocol
@ai-sdk/mcp Vercel AI SDK MCP client adapter
ai Vercel AI SDK core — generateText, generateObject
@ai-sdk/openai OpenAI provider (direct API key, NOT Vercel Gateway)
zod Schema validation for tool inputs and structured outputs
dotenv Environment variables

Setup

git clone https://github.com/salmankhan-prs/mcp-eval-harness.git
cd mcp-eval-harness
pnpm install
cp .env.example .env
# Add your OpenAI API key to .env

Pull Docker images for sandboxed execution (optional — falls back to local if Docker isn't running):

docker pull node:22-alpine
docker pull python:3.12-alpine

Usage

pnpm eval examples/problems.json

Example Output

Evaluating: Two Sum
Language: javascript
============================================================

Test Results: 3/3 passed

Quality Score: 8.5/10
  Correctness:  9/10
  Efficiency:   9/10
  Readability:  8/10
  Edge Cases:   8/10

Hallucination Check: No hallucinations detected

Verdict: PASS

Project Structure

mcp-eval-harness/
├── README.md
├── package.json
├── tsconfig.json
├── .env.example
├── .gitignore
├── server/
│   ├── index.ts                     # MCP server — registers tools, starts stdio transport
│   └── tools/
│       ├── run-code.ts              # Docker-sandboxed code execution (hacky but works)
│       ├── run-tests.ts             # Test runner — executes code against test cases
│       ├── score-answer.ts          # LLM-based quality scoring against rubric
│       └── detect-hallucination.ts  # LLM-based hallucination detection
├── client/
│   └── index.ts                     # AI agent that orchestrates evaluation via MCP
└── examples/
    └── problems.json                # Sample coding problems with test cases

How Each Tool Works

run_code — Detects if Docker is available. If yes, runs code in a container with --network=none, memory/CPU/PID limits. If Docker is unavailable, falls back to local child_process with a 10s timeout. In production you'd swap this for a Firecracker microVM pool.

run_tests — Wraps the candidate solution + each test input into a runnable script. Calls run_code for each test case and compares stdout to expected output. Convention: solution must define a solution() function.

score_answer — Sends the problem + solution to GPT-4o-mini with a structured output schema (Zod). Returns an overall score and per-criterion breakdown. This is what ReasonCore's expert evaluators do, but automated.

detect_hallucination — Sends a claim + context to GPT-4o-mini. Returns whether unsupported claims exist, a confidence score, and specific unsupported claims.

Prerequisites

  • Node.js 22+
  • Docker Desktop running (optional — falls back to local execution)
  • OpenAI API key in .env

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured