MCP Servers

mcp-prompt-lab

A local MCP server for prompt evaluation, enabling users to define test cases, run prompts against multiple LLM providers, score outputs with deterministic and LLM-graded assertions, and track quality over time, all within an AI coding environment.

README

mcp-prompt-lab

A local MCP server that brings prompt evaluation directly into your AI coding environment. Define test cases, run prompts against multiple LLM providers, score outputs with deterministic and LLM-graded assertions, and track quality over time — through 4 consolidated tools, 3 resources, and 2 prompt templates.

The only MCP server that exposes general-purpose prompt evaluation as MCP tools. Every other eval tool in the ecosystem tests MCP servers from the outside. This one brings eval capabilities into the host — so you iterate on prompts without leaving your editor.

Design Philosophy
Architecture Overview
Quick Start
Configuration
Tools Reference
Resources
Prompt Templates
Assertions Reference
Provider Configuration
Walkthrough: End-to-End Prompt Engineering
Use Cases
Database
Engineering Decisions

Design Philosophy

Most MCP servers on GitHub are thin API wrappers: one endpoint becomes one tool, names are generic, there is no error recovery, and the README says "install and run." This project takes the opposite approach, applying production-grade patterns to a domain that matters — evaluation is the scarcest skill in AI engineering right now, and this server makes it accessible from any MCP-compatible host.

Tool Consolidation (4 tools, not 15+)

The official MCP filesystem server exposes 13 tools. Agents work best with 10-15 tools maximum — after that, tool selection accuracy degrades. This server consolidates all evaluation operations into 4 tools grouped by user intent, not by CRUD operation:

Tool	Intent	Actions
`eval_assert`	"Check this output right now"	Single-purpose, no actions
`eval_suite`	"Set up an evaluation"	9 actions (CRUD for prompts + datasets + generation)
`eval_run`	"Run an evaluation"	Single-purpose, matrix execution
`eval_analyze`	"Look at results"	5 actions (get, compare, list, delete, trends)

Each tool uses action dispatch internally. The agent sees 4 clean entry points instead of 15+ individual tools competing for selection.

Dynamic Hints

Every response includes hints — not just on errors, but on success too. Five rules:

Errors say what happened AND what to do next. "Prompt 'x' exists (v2). To update, include checksum: 'abc123...'" — the fix is in the error message.
Resource status requiring special settings gets communicated. LLM-graded assertions without grader_provider → hint tells you exactly which param to add.
Success responses suggest the logical follow-up. Save a prompt → "Use eval_run with prompt_name 'x' to run evaluations."
Wrong values suggest available options. Invalid provider string → lists all supported providers.
Auto-corrections get reported. Variables auto-extracted from {{var}} patterns → response confirms what was detected.

Token-Aware Responses

An eval with 50 cases x 3 providers = 150 results. Dumping all into context would blow the token budget. Response strategy:

eval_run returns a summary (pass rates, avg scores, latency, token usage) + top 3 failures with reasons
Full per-case breakdowns live behind eval_analyze get_run with limit/offset pagination
Long outputs are truncated to a token budget before embedding in failure reports

This makes eval results usable even in contexts with aggressive token limits.

Checksum Pattern for Safe Mutations

Updating a saved prompt requires the current checksum (SHA-256 of the content). This prevents overwriting a prompt that was changed in another session or by another tool call — a real concern when multiple agents share the same MCP server. The error response always includes the current checksum, so recovery is one copy-paste away.

Architecture Overview

src/
  index.ts              STDIO transport entry point
  server.ts             MCP server setup: tools, resources, prompts
  tools/
    eval-assert.ts      Standalone assertion runner
    eval-suite.ts       Prompt & dataset CRUD + synthetic generation
    eval-run.ts         Evaluation execution orchestrator
    eval-analyze.ts     Run analysis, comparison, trends
  engine/
    assertions.ts       8 deterministic + 4 LLM-graded assertion types
    providers.ts        Unified provider resolution (6 providers + local)
    runner.ts           Concurrency-controlled matrix execution
  db/
    database.ts         SQLite (bun:sqlite) with WAL, migrations, lazy init
  utils/
    checksum.ts         SHA-256 via Bun.CryptoHasher
    hints.ts            Consistent response envelope formatting
    tokens.ts           Token estimation + truncation for context budget

Key architectural choices:

Bun-native SQLite (bun:sqlite) — zero dependencies for persistence, WAL mode for safe concurrent reads
Vercel AI SDK — thin provider abstraction with unified generateText/generateObject across all providers. No orchestration frameworks (no LangChain, no CrewAI)
Zod v4 — shared schema validation between MCP tool inputs and LLM grader structured outputs
STDIO-only transport — API keys stay in your local environment, eval history stays in local SQLite. Zero infrastructure to deploy.

Quick Start

Prerequisites: Bun installed.

git clone https://github.com/nicholasbarwicki/mcp-prompt-lab
cd mcp-prompt-lab
bun install

Verify the server starts:

bun run src/index.ts

To inspect with MCP Inspector: use STDIO transport, command bun, args ["run", "/absolute/path/to/src/index.ts"].

Configuration

Claude Code

Create .mcp.json in the project root (or add to ~/.claude.json globally):

{
  "mcpServers": {
    "prompt-lab": {
      "command": "bun",
      "args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
      "env": {
        "OPENAI_API_KEY": "${OPENAI_API_KEY}",
        "GOOGLE_GENERATIVE_AI_API_KEY": "${GOOGLE_GENERATIVE_AI_API_KEY}"
      }
    }
  }
}

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):

{
  "mcpServers": {
    "prompt-lab": {
      "command": "bun",
      "args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "GOOGLE_GENERATIVE_AI_API_KEY": "AIza..."
      }
    }
  }
}

Cursor / Windsurf / Any MCP Host

Same pattern — STDIO transport, bun run command, env vars for the providers you use.

Only include API keys for providers you intend to use. Deterministic assertions require no keys at all.

Bun auto-loads .env from the project root, so export OPENAI_API_KEY=sk-... in your shell works too.

Tools Reference

All tools return a consistent JSON envelope:

{
  "status": "success",
  "data": { "..." },
  "hints": ["What to do next..."]
}

Errors use "status": "error" with MCP's isError: true flag.

`eval_assert` — Standalone Assertion Runner

Run assertions against any text output. No saved prompts, no datasets, no API keys for deterministic checks. This is the tool you'll use daily — the zero-setup entry point.

Input schema:

Field	Type	Required	Description
`output`	string	yes	The text to evaluate
`assertions`	Assertion[]	yes	List of assertion objects
`expected`	string	no	Reference text for factuality/similarity
`input`	string	no	Original query for relevance assertions
`grader_provider`	string	no	Provider for LLM-graded assertions, e.g. `"openai:gpt-5-mini"`

Example: Validate a Classifier Output (Deterministic Only)

No API key needed. Instant results.

{
  "output": "{\"category\": \"billing\", \"confidence\": 0.87, \"priority\": \"high\"}",
  "assertions": [
    { "type": "is-json" },
    { "type": "contains", "value": "category" },
    { "type": "regex", "value": "\"confidence\":\\s*0\\.\\d+" },
    { "type": "length-max", "value": "200" }
  ]
}

Response:

{
  "status": "success",
  "data": {
    "pass": true,
    "score": 1.0,
    "results": [
      { "type": "is-json", "pass": true, "score": 1, "reason": "Valid JSON", "weight": 1 },
      {
        "type": "contains",
        "value": "category",
        "pass": true,
        "score": 1,
        "reason": "Output contains \"category\"",
        "weight": 1
      },
      {
        "type": "regex",
        "value": "\"confidence\":\\s*0\\.\\d+",
        "pass": true,
        "score": 1,
        "reason": "Output matches regex",
        "weight": 1
      },
      {
        "type": "length-max",
        "value": "200",
        "pass": true,
        "score": 1,
        "reason": "Output length 58 is within max 200",
        "weight": 1
      }
    ]
  },
  "hints": ["All assertions passed. Score: 1.00."]
}

Example: LLM-Graded Quality Check

Uses a grader model to evaluate subjective criteria.

{
  "output": "Hey! So like, your account is kinda messed up. Gonna fix it tho, no worries lol",
  "assertions": [
    { "type": "llm-rubric", "value": "Response is professional and concise", "weight": 2 },
    { "type": "not-contains", "value": "lol" },
    { "type": "length-max", "value": "500" }
  ],
  "grader_provider": "openai:gpt-5-mini"
}

The grader uses generateObject with a Zod schema — guaranteed structured { pass, score, reason } response, no fragile regex parsing.

Example: Factuality Check Against Reference

{
  "output": "The Eiffel Tower is 330 meters tall and was completed in 1889.",
  "expected": "The Eiffel Tower is 330 meters tall, completed in 1889 for the World's Fair.",
  "assertions": [{ "type": "factuality" }, { "type": "contains", "value": "1889" }],
  "grader_provider": "openai:gpt-5-mini"
}

`eval_suite` — Prompt & Dataset Manager

CRUD for prompts and test datasets, plus LLM-powered synthetic test generation. All actions dispatched via the action field.

Input schema:

Field	Type	Required for	Description
`action`	enum	always	One of the 9 actions below
`name`	string	most actions	Prompt or dataset name
`content`	string	`save_prompt`	Prompt template content
`variables`	string[]	no	Variable names (auto-extracted from `{{var}}` if omitted)
`tags`	string[]	no	Tags for categorization and filtering
`checksum`	string	updating prompt	Current checksum (required to update existing prompt)
`cases`	Case[]	`save_dataset`	Array of `{ vars, expected?, description? }`
`prompt_description`	string	`generate_dataset`	What the prompt does
`count`	number	no	Cases to generate (default: 5)
`provider`	string	`generate_dataset`	Provider for generation (required explicitly, no default)
`limit` / `offset`	number	no	Pagination for list actions

Actions:

Action	Required fields	Description
`save_prompt`	`name`, `content`	Upsert prompt. Auto-extracts `{{var}}` variables. Update requires matching `checksum`.
`get_prompt`	`name`	Retrieve full prompt content and metadata
`list_prompts`	--	Paginated summary list
`delete_prompt`	`name`	Remove prompt
`save_dataset`	`name`, `cases`	Upsert dataset
`get_dataset`	`name`	Retrieve dataset with all cases
`list_datasets`	--	Paginated summary list
`delete_dataset`	`name`	Remove dataset
`generate_dataset`	`prompt_description`, `provider`	Generate synthetic test cases via LLM

Example: Save a Prompt (Variables Auto-Extracted)

{
  "action": "save_prompt",
  "name": "ticket-classifier",
  "content": "You are a support ticket classifier.\n\nGiven a customer message, output JSON:\n- category: billing, technical, general\n- priority: low, medium, high\n- confidence: 0.0-1.0\n\nCustomer message: {{message}}",
  "tags": ["classification", "support"]
}

The {{message}} variable is auto-extracted. Response includes checksum for future updates:

{
  "status": "success",
  "data": {
    "action": "created",
    "name": "ticket-classifier",
    "variables": ["message"],
    "version": 1,
    "checksum": "a1b2c3..."
  },
  "hints": ["Use eval_run with prompt_name 'ticket-classifier' to run evaluations."]
}

Example: Update a Prompt (Checksum Required)

{
  "action": "save_prompt",
  "name": "ticket-classifier",
  "content": "You are a support ticket classifier. Be strict about confidence — only output >0.8 when you're sure.\n\n{{message}}",
  "checksum": "a1b2c3..."
}

If you omit the checksum, the error tells you exactly what to provide:

"Prompt 'ticket-classifier' exists (v1). To update, include checksum: 'a1b2c3...'"

Example: Generate a Synthetic Test Dataset

Have an LLM create diverse test cases from a description. Includes happy path, edge cases, and adversarial inputs automatically.

{
  "action": "generate_dataset",
  "prompt_description": "Classifies customer support tickets into category (billing/technical/general), priority (low/medium/high), and confidence (0-1)",
  "count": 5,
  "provider": "openai:gpt-5-mini",
  "name": "ticket-tests-v1",
  "tags": ["synthetic", "classification"]
}

Generated cases are saved to the database. Review with get_dataset before running an eval.

Example: Save a Manual Dataset

{
  "action": "save_dataset",
  "name": "ticket-tests-curated",
  "cases": [
    {
      "vars": { "message": "I was charged twice for my subscription this month" },
      "expected": "{\"category\": \"billing\", \"priority\": \"high\"}",
      "description": "Clear billing issue"
    },
    {
      "vars": { "message": "hey" },
      "description": "Adversarial: vague single-word input"
    },
    {
      "vars": { "message": "The app crashes when I open settings on Android 14" },
      "expected": "{\"category\": \"technical\", \"priority\": \"medium\"}",
      "description": "Technical issue with platform detail"
    }
  ],
  "tags": ["curated", "classification"]
}

`eval_run` — Execute Evaluations

The core evaluation engine. Runs a prompt against one or more providers across test cases, scores each output with assertions, and stores everything in SQLite.

Execution model: prompt x providers x cases x assertions = scored result matrix, run with configurable concurrency.

Input schema:

Field	Type	Required	Description
`prompt_name`	string	one of	Name of a saved prompt
`prompt`	string	one of	Inline prompt content
`dataset_name`	string	one of	Name of a saved dataset
`cases`	Case[]	one of	Inline test cases
`providers`	string[]	yes	Provider strings (default: `["openai:gpt-4o"]`)
`assertions`	Assertion[]	no	If omitted, outputs are collected with `score: 1.0`
`tags`	string[]	no	Tags for filtering runs later
`temperature`	number	no	Default: `0` (deterministic)
`max_tokens`	number	no	Default: `1024`
`concurrency`	number	no	Parallel requests (default: `3`)
`grader_provider`	string	no	Required for LLM-graded assertions

Providing both prompt_name and prompt is an error (ambiguous input). Same for dataset_name + cases.

Example: Compare Two Providers

{
  "prompt_name": "ticket-classifier",
  "dataset_name": "ticket-tests-v1",
  "providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
  "assertions": [
    { "type": "is-json" },
    { "type": "contains", "value": "category" },
    { "type": "contains", "value": "priority" },
    {
      "type": "llm-rubric",
      "value": "Classification is reasonable for the given customer message",
      "weight": 2
    }
  ],
  "grader_provider": "openai:gpt-5-mini",
  "tags": ["v1", "model-comparison"]
}

Response (token-aware summary):

{
  "status": "success",
  "data": {
    "runId": 1,
    "providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
    "totalCases": 5,
    "passRate": { "openai:gpt-5-mini": 0.8, "google:gemini-3.1-flash-lite-preview": 0.6 },
    "avgScore": { "openai:gpt-5-mini": 0.85, "google:gemini-3.1-flash-lite-preview": 0.72 },
    "avgLatencyMs": { "openai:gpt-5-mini": 650, "google:gemini-3.1-flash-lite-preview": 420 },
    "totalTokens": { "openai:gpt-5-mini": 3200, "google:gemini-3.1-flash-lite-preview": 2800 },
    "topFailures": [
      {
        "caseIndex": 1,
        "description": "Adversarial: vague single-word input",
        "provider": "google:gemini-3.1-flash-lite-preview",
        "output": "I'd be happy to help! Could you...",
        "failedAssertions": ["is-json: Invalid JSON"]
      }
    ]
  },
  "hints": [
    "Run complete. Use eval_analyze with action: 'get_run', run_id: 1 for full per-case breakdown.",
    "Top providers by pass rate: openai:gpt-5-mini: 80%, google:gemini-3.1-flash-lite-preview: 60%"
  ]
}

Notice: only the summary and top 3 failures — not all 10 individual results. Use eval_analyze for the full breakdown.

Example: Quick Inline Eval (No Saved Data)

No need to save anything — pass prompt and cases directly:

{
  "prompt": "Translate the following English text to French:\n\n{{text}}",
  "cases": [
    { "vars": { "text": "Hello, how are you?" }, "expected": "Bonjour, comment allez-vous ?" },
    {
      "vars": { "text": "The weather is nice today" },
      "expected": "Le temps est beau aujourd'hui"
    },
    { "vars": { "text": "" }, "description": "Edge case: empty input" }
  ],
  "providers": ["openai:gpt-5-mini"],
  "assertions": [
    { "type": "length-min", "value": "1" },
    { "type": "similarity", "threshold": 0.8 }
  ],
  "grader_provider": "openai:gpt-5-mini"
}

`eval_analyze` — Inspect & Compare Results

Browse, compare, and manage stored evaluation runs. This is where you analyze failures, detect regressions, and track quality trends.

Input schema:

Field	Type	Required for	Description
`action`	enum	always	One of the 5 actions below
`run_id`	number	`get_run`, `delete_run`	Run ID
`run_ids`	number[]	`compare_runs`	At least 2 run IDs
`prompt_name`	string	no	Filter by prompt name
`tag`	string	no	Filter by tag
`limit` / `offset`	number	no	Pagination (default: 20)
`provider`	string	no	Filter results by provider
`only_failed`	boolean	no	Return only failed cases

Actions:

Action	Description
`get_run`	Full run details with paginated per-case results. Supports `provider` and `only_failed` filters.
`compare_runs`	Side-by-side score comparison between 2+ runs. Computes deltas, detects regressions, counts improvements.
`list_runs`	Recent runs, filterable by `prompt_name` and `tag`.
`delete_run`	Remove run and all its results (CASCADE).
`trends`	Raw `{run_id, date, providers, avg_score, pass_rate}` list ordered by date.

Example: Deep Dive Into Failures

{
  "action": "get_run",
  "run_id": 1,
  "only_failed": true,
  "provider": "google:gemini-3.1-flash-lite-preview"
}

Returns only the cases that failed for Gemini, with full assertion results and the actual model output.

Example: Compare Runs After a Prompt Change

You fixed the prompt and re-ran. Now compare:

{
  "action": "compare_runs",
  "run_ids": [1, 2]
}

Response:

{
  "status": "success",
  "data": {
    "runs": ["run_1", "run_2"],
    "scoreChanges": {
      "openai:gpt-5-mini": "+0.10 (0.85 -> 0.95)",
      "google:gemini-3.1-flash-lite-preview": "+0.23 (0.72 -> 0.95)"
    },
    "regressions": [],
    "improvements": 2
  },
  "hints": ["2 provider(s) improved. No regressions detected."]
}

Example: Quality Trends Over Time

{
  "action": "trends",
  "prompt_name": "ticket-classifier"
}

Returns chronological data for charting score progression across runs.

Resources

Three MCP resources are exposed for supporting clients to browse server state.

URI	Type	Contents
`prompt-lab://prompts`	Static	JSON array of prompt summaries: name, version, variable count, tags
`prompt-lab://datasets`	Static	JSON array of dataset summaries: name, case count, tags
`prompt-lab://runs/{runId}`	Dynamic template	Full run metadata and summary for a specific run ID

The runs resource supports URI completion — clients can tab-complete run IDs from the 50 most recent runs.

Resources return summaries only. Full content lives behind the tools (eval_suite get_prompt, eval_suite get_dataset, eval_analyze get_run). This separation keeps resource reads lightweight while full data is available on demand.

Prompt Templates

Two MCP prompt templates are registered for quick invocation from supporting clients.

`quick-eval`

Run a quick evaluation of a prompt against a single test input.

Arguments: prompt, test_input, provider

Generates a message that instructs the host to call eval_run and interpret results — a one-shot workflow.

`generate-tests`

Generate a synthetic test dataset for a prompt.

Arguments: prompt_description, variable_names, count

Generates a message that calls eval_suite generate_dataset with guidance to include happy-path, edge, and adversarial cases.

Assertions Reference

Deterministic (free, instant, no API key required)

Type	Checks	`value` field
`contains`	Substring is present	Substring to find
`not-contains`	Substring is absent	Substring to reject
`equals`	Exact match (trimmed)	Expected string
`regex`	Regex pattern matches	Regex pattern
`starts-with`	Output begins with prefix	Prefix string
`is-json`	Output is valid JSON	Not used
`length-max`	Output length <= N chars	Max length as string, e.g. `"500"`
`length-min`	Output length >= N chars	Min length as string, e.g. `"10"`

LLM-Graded (requires `grader_provider`)

Type	Checks	Uses
`llm-rubric`	Free-form criteria	`value` = rubric text
`factuality`	Output matches expected facts	`expected` field as reference
`relevance`	Output answers the query	`input` field as the query
`similarity`	Semantic similarity to expected	`expected` field + `threshold`

All assertions support a weight field (default: 1). The aggregate score is sum(weight * score) / sum(weight).

LLM-graded assertions use generateObject with a Zod schema for guaranteed structured { pass, score, reason } responses — no fragile regex or JSON.parse on raw LLM text. Each assertion type has a distinct grading prompt optimized for that evaluation style.

Provider Configuration

Provider strings follow the format "provider:model":

Provider string	Required env var	Example
`openai:*`	`OPENAI_API_KEY`	`openai:gpt-5-mini`
`anthropic:*`	`ANTHROPIC_API_KEY`	`anthropic:claude-sonnet-4-20250514`
`google:*`	`GOOGLE_GENERATIVE_AI_API_KEY`	`google:gemini-3.1-flash-lite-preview`
`xai:*`	`XAI_API_KEY`	`xai:grok-3`
`openrouter:*`	`OPENROUTER_API_KEY`	`openrouter:meta-llama/llama-4-scout`
`local:*`	`LOCAL_LLM_URL` (optional)	`local:my-model`

The local provider uses an OpenAI-compatible endpoint. LOCAL_LLM_URL defaults to http://localhost:1234/v1 (LM Studio default).

Missing API keys produce clear, actionable errors:

Provider 'anthropic:claude-sonnet-4-20250514' requires ANTHROPIC_API_KEY.
Set it in your MCP server config env block or shell environment.

No provider has a default for LLM-graded assertions or dataset generation. You must specify grader_provider and provider explicitly — no silent API spending.

Provider resolution is powered by Vercel AI SDK — a thin abstraction layer over each provider's SDK. It handles retries, types, and multi-provider support in ~5 lines per provider. No orchestration frameworks, no hidden magic. If needed, each provider can be swapped to a raw SDK call without changing the tool interface.

Walkthrough: End-to-End Prompt Engineering

This walkthrough demonstrates the full iterative loop — from writing a prompt to shipping a validated version. Everything happens inside your MCP host (Claude Code, Cursor, Claude Desktop).

Step 1: Save Your Prompt

"Save this as 'meeting-action-items':

Extract action items from this meeting transcript. Return a JSON array
where each item has: owner, task, due_date (or null if not mentioned).

{{transcript}}"

Claude calls eval_suite with save_prompt. Variables auto-extracted: ["transcript"].

Step 2: Generate Test Cases

"Generate 5 test cases for meeting-action-items using openai:gpt-5-mini"

Claude calls eval_suite with generate_dataset. The LLM creates diverse cases:

Simple 1-on-1 meeting with clear action items
Multi-person standup with overlapping responsibilities
Vague meeting with no concrete actions
Conflicting assignments (same task, two owners)
Long rambling transcript with buried action items

Step 3: Run the First Eval

"Run meeting-action-items against the generated dataset using openai:gpt-5-mini and google:gemini-3.1-flash-lite-preview.
Assert: valid JSON, contains 'owner', and use llm-rubric 'All action items are accurately extracted with correct owners'.
Use openai:gpt-5-mini as the grader."

Result: GPT-5-mini: 80% pass rate. Gemini: 60%. Top failures:

Case 3 (vague meeting): Gemini hallucinated action items that weren't in the transcript
Case 4 (conflicting assignments): Both models assigned the task to only one person

Step 4: Fix the Prompt and Re-Run

"Update meeting-action-items to handle conflicts by noting both owners, and add an instruction
to output an empty array when no clear actions exist. Then re-run the same eval."

Claude calls eval_suite save_prompt with the checksum from v1, then eval_run with identical parameters.

Step 5: Compare Runs

"Compare the two runs"

Claude calls eval_analyze compare_runs:

GPT-5-mini: +0.15 (0.80 -> 0.95)
Gemini:     +0.35 (0.60 -> 0.95)
Regressions: none

Both providers now pass at 95%. The prompt fix for conflict handling helped Gemini the most. No regressions. Ship it.

Step 6: Regression Check Later

A week later, you tweak the prompt for a new edge case.

"List recent runs for meeting-action-items, then run the same tests and compare against the last run"

Claude calls eval_analyze list_runs, then eval_run, then eval_analyze compare_runs. Instant confidence check before deploying the change.

Use Cases

Composability Matrix

What you want	Tools used	Setup needed
"Check this output"	`eval_assert` only	None
"Test my prompt"	`eval_suite` + `eval_run`	30 seconds
"Compare v1 vs v2"	`eval_run` x2 + `eval_analyze`	Already have v1
"Did I break anything?"	`eval_analyze` + `eval_run` + `eval_analyze`	Already have history
"Which model is best?"	`eval_run` (multiple providers)	Just a prompt + cases
"Generate test data"	`eval_suite generate_dataset`	Just a description
"Track quality over time"	`eval_analyze trends`	Already have runs

Specific Scenarios

Prompt iteration for classification tasks. Save a classifier prompt, generate synthetic edge cases, run against 2-3 providers, read failures, fix the prompt, re-run, compare. The full loop in one conversation.

Model selection for production. Same prompt, same test suite, 3 providers. One eval_run call gives pass rates, scores, latency, and token cost per provider. Data-driven model choice instead of vibes.

Pre-deploy regression testing. Changed a prompt? Run the existing test suite, compare against the last known-good run. Zero regressions = safe to deploy.

Output validation in agent pipelines. Use eval_assert as a quality gate — check that an agent's output is valid JSON, contains required fields, passes a rubric. No eval infrastructure needed.

Synthetic test generation. Describe what your prompt does, get diverse test cases including adversarial inputs. Review and curate before running evals.

Database

Created lazily on first tool call (STDIO servers must start instantly — no slow init).

Default location: ~/.mcp-prompt-lab/prompt-lab.db

Override with PROMPT_LAB_DB:

"env": { "PROMPT_LAB_DB": "/custom/path/prompt-lab.db" }

Reset: rm -rf ~/.mcp-prompt-lab

Backup: cp ~/.mcp-prompt-lab/prompt-lab.db ~/backups/prompt-lab-$(date +%Y%m%d).db

Schema: 4 tables — prompts, datasets, runs, results. Deleting a run cascades to its results.

Pragmas:

journal_mode=WAL — safe concurrent reads from multiple processes
foreign_keys=ON — enforced referential integrity
user_version — schema migration tracking (version 0 = run full schema, set to 1)

Engineering Decisions

Choices made deliberately, not by default.

Decision	Resolution	Why
4 tools, not 15	Group by user intent, not CRUD operation	Agents select tools better with fewer, well-scoped options. 4 is well under the 10-15 tool limit.
STDIO-only, no HTTP	Local-first: keys in env, data in SQLite	Eval data is sensitive (prompts, test cases, model outputs). Remote HTTP would need auth, sessions, key management.
Vercel AI SDK	Thin provider abstraction, not an orchestration framework	Unified `generateText`/`generateObject` across 6 providers. Can migrate to raw SDKs later — the tool interface doesn't change.
Explicit grader_provider	No default grader, no silent API calls	LLM-graded assertions cost tokens. The user must opt in per-call.
Checksum on mutations	SHA-256 of prompt content	Prevents overwriting prompts changed in another session. Error includes current checksum for easy recovery.
Token-aware summaries	Summary + top 3 failures, details via pagination	150 results in a single response would blow context. Summary-first keeps results usable in any host.
Weighted scoring	`weight` on every assertion	A `length-max` check and an `llm-rubric` on accuracy aren't equally important. Weights let the score reflect actual quality criteria.
Global DB, not per-project	`~/.mcp-prompt-lab/` default, overridable via env	Eval data is cross-project by nature. You're comparing prompts, not building per-repo config.
Lazy DB init	Created on first tool call, not server start	STDIO servers must respond to `initialize` instantly. SQLite setup happens when actually needed.
Zod v4 for grader schemas	`generateObject` with typed schemas	Guaranteed structured grader responses. No regex parsing of raw LLM text.
Static hints, not generated	Hardcoded per action and outcome	Predictable, fast, zero tokens. Every response guides the next logical action.
Bun-native SQLite	`bun:sqlite`, zero npm dependencies for DB	One less dependency. Bun's SQLite is fast, synchronous, and supports WAL natively.

License

MIT

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

mcp-prompt-lab

README

mcp-prompt-lab

Table of Contents

Design Philosophy

Tool Consolidation (4 tools, not 15+)

Dynamic Hints

Token-Aware Responses

Checksum Pattern for Safe Mutations

Architecture Overview

Quick Start

Configuration

Claude Code

Claude Desktop

Cursor / Windsurf / Any MCP Host

Tools Reference

eval_assert — Standalone Assertion Runner

Example: Validate a Classifier Output (Deterministic Only)

Example: LLM-Graded Quality Check

Example: Factuality Check Against Reference

eval_suite — Prompt & Dataset Manager

Example: Save a Prompt (Variables Auto-Extracted)

Example: Update a Prompt (Checksum Required)

Example: Generate a Synthetic Test Dataset

Example: Save a Manual Dataset

eval_run — Execute Evaluations

Example: Compare Two Providers

Example: Quick Inline Eval (No Saved Data)

eval_analyze — Inspect & Compare Results

Example: Deep Dive Into Failures

Example: Compare Runs After a Prompt Change

Example: Quality Trends Over Time

Resources

Prompt Templates

quick-eval

generate-tests

Assertions Reference

Deterministic (free, instant, no API key required)

LLM-Graded (requires grader_provider)

Provider Configuration

Walkthrough: End-to-End Prompt Engineering

Step 1: Save Your Prompt

Step 2: Generate Test Cases

Step 3: Run the First Eval

Step 4: Fix the Prompt and Re-Run

Step 5: Compare Runs

Step 6: Regression Check Later

Use Cases

Composability Matrix

Specific Scenarios

Database

Engineering Decisions

License

Recommended Servers

`eval_assert` — Standalone Assertion Runner

`eval_suite` — Prompt & Dataset Manager

`eval_run` — Execute Evaluations

`eval_analyze` — Inspect & Compare Results

`quick-eval`

`generate-tests`

LLM-Graded (requires `grader_provider`)