MCP Servers

HumanJudge

Human-evaluation infrastructure for AI quality. 25,000+ blind human reviews by 200+ verified reviewers across 58 AI models — query the data via five MCP tools (get_model_scores, compare_models, get_flags, check_content, get_latest).

README

grandjury

Real human evaluations of AI models. 25,000+ blind reviews by 200+ verified reviewers across 58 models (GPT-5, Claude Opus 4.7, Gemini 3.1, Grok 4.3, DeepSeek V4, Mistral, Kimi K2.6 and more) and 44 benchmarks. Free. Python SDK + MCP server + ChatGPT GPT + REST.

Get human feedback on your AI in 3 lines of Python:

from grandjury import GrandJury

gj = GrandJury()  # reads GRANDJURY_API_KEY from env
gj.trace(name="chat", input=prompt, output=response, model="gpt-4o")

Then open your Jupyter notebook:

df = gj.results()  # traces with human votes — as a DataFrame
print(f"Pass rate: {df['pass_rate'].mean():.1%}")

Patent Pending.

Why HumanJudge

Most AI evaluation pipelines use LLMs to judge LLMs. That inherits the same biases, conventions, and blind spots as the models being evaluated — and tends to produce eval pipelines with ~0% disagreement, which is the diagnostic for "not measuring quality, just confirming assumptions" (essay).

HumanJudge uses real human reviewers who blind-evaluate AI outputs across structured benchmarks (marketing, healthcare, end-of-life conversations, cultural fluency, code review, and more) and write their reasoning. Reviewers earn XP, get credentialing letters, and stay anonymous to the reader by default.

The data is queryable via this SDK, the MCP server, a ChatGPT GPT action, and a REST API.

Integrations

Surface	Install	Docs
Python SDK	`pip install grandjury`	docs/pulse/python-sdk
Claude Desktop MCP	Add `https://api.humanjudge.com/mcp` as a custom connector	docs/pulse/claude-desktop
Claude Code MCP	Add to `.mcp.json` (remote, no install)	docs/pulse/claude-code
ChatGPT GPT	Search "HumanJudge" in the GPT Store	docs/pulse/chatgpt
REST API	n/a	humanjudge.com/docs

Use cases

ML engineers — benchmark your model against 58+ commercial models on real tasks; see exactly what humans flag with category + reasoning
Data scientists — pull reviewer reasoning, flag patterns, and disagreement data as pandas DataFrames for analysis
AI agent developers — log traces from any agent loop (decorator, context manager, or direct call); get reviewer feedback you can quote to stakeholders
Independent researchers — query the public benchmark data without an API key (read-only)
Builders — register your own AI, create a custom benchmark on the topics you care about, get real human reviews on YOUR specific use case (humanjudge.com/for-developers)

What is GrandJury?

HumanJudge connects your AI to a community of human reviewers who evaluate your model's outputs. GrandJury is the Python SDK — it sends traces and retrieves human evaluation results.

Write path: Log AI calls from your app → traces appear in your developer dashboard. Read path: Fetch evaluation results (votes, pass rates, reviewer feedback) into DataFrames for analysis.

Installation

pip install grandjury

Optional performance dependencies:

pip install grandjury[performance]  # msgspec, pyarrow, polars

Quick Start

1. Register your model

Go to humanjudge.com/projects/new, register your AI, and copy the secret key.

export GRANDJURY_API_KEY=gj_sk_live_...

2. Log traces from your app

from grandjury import GrandJury

gj = GrandJury()  # zero-config — reads from env

# Option A: Direct call
gj.trace(name="chat", input="What is ML?", output="Machine learning is...", model="gpt-4o")

# Option B: Decorator — auto-captures input/output/latency
@gj.observe(name="chat", model="gpt-4o")
def call_llm(prompt: str) -> str:
    return openai.chat(prompt)

# Option C: Context manager
with gj.span("chat", input=prompt) as s:
    response = call_llm(prompt)
    s.set_output(response)

3. Get human evaluation results

Once reviewers vote on your traces:

# Trace-level summary
df = gj.results()
# trace_id | input | output | model | pass_count | flag_count | total_votes | pass_rate

# Individual votes with reviewer identity
df_votes = gj.results(detail='votes')
# trace_id | voter_id | voter_name | verdict | flag_category | feedback | created_at

# Filter by benchmark
df_benchmark = gj.results(evaluation='marketing-benchmark')

# Export
df.to_parquet('evaluation_results.parquet')

4. Run analytics

Works on both live platform data and offline datasets:

# Auto-fetch from platform
gj.analytics.vote_histogram()
gj.analytics.population_confidence(voter_list=[...])

# Or pass your own data
import pandas as pd
df = pd.read_csv("my_votes.csv")
gj.analytics.vote_histogram(df)
gj.analytics.votes_distribution(df)

Enroll in Benchmarks

List and enroll your model in open benchmarks programmatically:

# Browse available benchmarks
benchmarks = gj.benchmarks.list()

# Enroll with endpoint config
gj.benchmarks.enroll(
    benchmark_id="...",
    model_id="...",
    endpoint_config={
        "endpoint": "https://api.myapp.com/v1/chat/completions",
        "apiKey": "sk-...",
        "request_template": '{"model":"gpt-4o","messages":[{"role":"user","content":"{{prompt}}"}]}',
        "response_path": "choices[0].message.content"
    }
)

Analytics Methods

All analytics methods work on both platform data (gj.results(detail='votes')) and offline data (pandas/polars/CSV/parquet):

Method	Description
`gj.analytics.evaluate_model()`	Decay-adjusted scoring
`gj.analytics.vote_histogram()`	Vote time distribution
`gj.analytics.vote_completeness()`	Completeness per voter
`gj.analytics.population_confidence()`	Confidence metrics
`gj.analytics.majority_good_votes()`	Threshold analysis
`gj.analytics.votes_distribution()`	Votes per inference

Privacy

gj.results() only returns traces with at least 1 human vote (privacy gate)
Zero-vote traces are invisible to the SDK — only visible on the web dashboard
Reviewer identity is public (consistent with platform's public profile/leaderboard model)

API Reference

gj = GrandJury(
    api_key=None,     # reads GRANDJURY_API_KEY from env if not provided
    base_url="https://grandjury-server.onrender.com",
    timeout=5.0,
)

# Write
gj.trace(name, input, output, model, latency_ms, metadata, gj_inference_id)
await gj.atrace(...)  # async version (requires httpx)
gj.observe(name, model, metadata)  # decorator
gj.span(name, input, model, metadata)  # context manager

# Read
gj.results(detail=None, evaluation=None)  # returns DataFrame or list[dict]

# Browse
gj.models.list()
gj.models.get(model_id)
gj.benchmarks.list()
gj.benchmarks.enroll(benchmark_id, model_id, endpoint_config)

# Analytics
gj.analytics.evaluate_model(...)
gj.analytics.vote_histogram(data=None, ...)
gj.analytics.vote_completeness(data=None, voter_list=None, ...)
gj.analytics.population_confidence(data=None, voter_list=None, ...)
gj.analytics.majority_good_votes(data=None, ...)
gj.analytics.votes_distribution(data=None, ...)

Contributing

See CONTRIBUTING.md for development setup, testing, and PR guidelines.

License

See LICENSE.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured