mcp-prompt-lab

mcp-prompt-lab

A local MCP server for prompt evaluation, enabling users to define test cases, run prompts against multiple LLM providers, score outputs with deterministic and LLM-graded assertions, and track quality over time, all within an AI coding environment.

Category
Visit Server

README

mcp-prompt-lab

A local MCP server that brings prompt evaluation directly into your AI coding environment. Define test cases, run prompts against multiple LLM providers, score outputs with deterministic and LLM-graded assertions, and track quality over time — through 4 consolidated tools, 3 resources, and 2 prompt templates.

The only MCP server that exposes general-purpose prompt evaluation as MCP tools. Every other eval tool in the ecosystem tests MCP servers from the outside. This one brings eval capabilities into the host — so you iterate on prompts without leaving your editor.


Table of Contents


Design Philosophy

Most MCP servers on GitHub are thin API wrappers: one endpoint becomes one tool, names are generic, there is no error recovery, and the README says "install and run." This project takes the opposite approach, applying production-grade patterns to a domain that matters — evaluation is the scarcest skill in AI engineering right now, and this server makes it accessible from any MCP-compatible host.

Tool Consolidation (4 tools, not 15+)

The official MCP filesystem server exposes 13 tools. Agents work best with 10-15 tools maximum — after that, tool selection accuracy degrades. This server consolidates all evaluation operations into 4 tools grouped by user intent, not by CRUD operation:

Tool Intent Actions
eval_assert "Check this output right now" Single-purpose, no actions
eval_suite "Set up an evaluation" 9 actions (CRUD for prompts + datasets + generation)
eval_run "Run an evaluation" Single-purpose, matrix execution
eval_analyze "Look at results" 5 actions (get, compare, list, delete, trends)

Each tool uses action dispatch internally. The agent sees 4 clean entry points instead of 15+ individual tools competing for selection.

Dynamic Hints

Every response includes hints — not just on errors, but on success too. Five rules:

  1. Errors say what happened AND what to do next. "Prompt 'x' exists (v2). To update, include checksum: 'abc123...'" — the fix is in the error message.
  2. Resource status requiring special settings gets communicated. LLM-graded assertions without grader_provider → hint tells you exactly which param to add.
  3. Success responses suggest the logical follow-up. Save a prompt → "Use eval_run with prompt_name 'x' to run evaluations."
  4. Wrong values suggest available options. Invalid provider string → lists all supported providers.
  5. Auto-corrections get reported. Variables auto-extracted from {{var}} patterns → response confirms what was detected.

Token-Aware Responses

An eval with 50 cases x 3 providers = 150 results. Dumping all into context would blow the token budget. Response strategy:

  • eval_run returns a summary (pass rates, avg scores, latency, token usage) + top 3 failures with reasons
  • Full per-case breakdowns live behind eval_analyze get_run with limit/offset pagination
  • Long outputs are truncated to a token budget before embedding in failure reports

This makes eval results usable even in contexts with aggressive token limits.

Checksum Pattern for Safe Mutations

Updating a saved prompt requires the current checksum (SHA-256 of the content). This prevents overwriting a prompt that was changed in another session or by another tool call — a real concern when multiple agents share the same MCP server. The error response always includes the current checksum, so recovery is one copy-paste away.


Architecture Overview

src/
  index.ts              STDIO transport entry point
  server.ts             MCP server setup: tools, resources, prompts
  tools/
    eval-assert.ts      Standalone assertion runner
    eval-suite.ts       Prompt & dataset CRUD + synthetic generation
    eval-run.ts         Evaluation execution orchestrator
    eval-analyze.ts     Run analysis, comparison, trends
  engine/
    assertions.ts       8 deterministic + 4 LLM-graded assertion types
    providers.ts        Unified provider resolution (6 providers + local)
    runner.ts           Concurrency-controlled matrix execution
  db/
    database.ts         SQLite (bun:sqlite) with WAL, migrations, lazy init
  utils/
    checksum.ts         SHA-256 via Bun.CryptoHasher
    hints.ts            Consistent response envelope formatting
    tokens.ts           Token estimation + truncation for context budget

Key architectural choices:

  • Bun-native SQLite (bun:sqlite) — zero dependencies for persistence, WAL mode for safe concurrent reads
  • Vercel AI SDK — thin provider abstraction with unified generateText/generateObject across all providers. No orchestration frameworks (no LangChain, no CrewAI)
  • Zod v4 — shared schema validation between MCP tool inputs and LLM grader structured outputs
  • STDIO-only transport — API keys stay in your local environment, eval history stays in local SQLite. Zero infrastructure to deploy.

Quick Start

Prerequisites: Bun installed.

git clone https://github.com/nicholasbarwicki/mcp-prompt-lab
cd mcp-prompt-lab
bun install

Verify the server starts:

bun run src/index.ts

To inspect with MCP Inspector: use STDIO transport, command bun, args ["run", "/absolute/path/to/src/index.ts"].


Configuration

Claude Code

Create .mcp.json in the project root (or add to ~/.claude.json globally):

{
  "mcpServers": {
    "prompt-lab": {
      "command": "bun",
      "args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
      "env": {
        "OPENAI_API_KEY": "${OPENAI_API_KEY}",
        "GOOGLE_GENERATIVE_AI_API_KEY": "${GOOGLE_GENERATIVE_AI_API_KEY}"
      }
    }
  }
}

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):

{
  "mcpServers": {
    "prompt-lab": {
      "command": "bun",
      "args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "GOOGLE_GENERATIVE_AI_API_KEY": "AIza..."
      }
    }
  }
}

Cursor / Windsurf / Any MCP Host

Same pattern — STDIO transport, bun run command, env vars for the providers you use.

Only include API keys for providers you intend to use. Deterministic assertions require no keys at all.

Bun auto-loads .env from the project root, so export OPENAI_API_KEY=sk-... in your shell works too.


Tools Reference

All tools return a consistent JSON envelope:

{
  "status": "success",
  "data": { "..." },
  "hints": ["What to do next..."]
}

Errors use "status": "error" with MCP's isError: true flag.


eval_assert — Standalone Assertion Runner

Run assertions against any text output. No saved prompts, no datasets, no API keys for deterministic checks. This is the tool you'll use daily — the zero-setup entry point.

Input schema:

Field Type Required Description
output string yes The text to evaluate
assertions Assertion[] yes List of assertion objects
expected string no Reference text for factuality/similarity
input string no Original query for relevance assertions
grader_provider string no Provider for LLM-graded assertions, e.g. "openai:gpt-5-mini"

Example: Validate a Classifier Output (Deterministic Only)

No API key needed. Instant results.

{
  "output": "{\"category\": \"billing\", \"confidence\": 0.87, \"priority\": \"high\"}",
  "assertions": [
    { "type": "is-json" },
    { "type": "contains", "value": "category" },
    { "type": "regex", "value": "\"confidence\":\\s*0\\.\\d+" },
    { "type": "length-max", "value": "200" }
  ]
}

Response:

{
  "status": "success",
  "data": {
    "pass": true,
    "score": 1.0,
    "results": [
      { "type": "is-json", "pass": true, "score": 1, "reason": "Valid JSON", "weight": 1 },
      {
        "type": "contains",
        "value": "category",
        "pass": true,
        "score": 1,
        "reason": "Output contains \"category\"",
        "weight": 1
      },
      {
        "type": "regex",
        "value": "\"confidence\":\\s*0\\.\\d+",
        "pass": true,
        "score": 1,
        "reason": "Output matches regex",
        "weight": 1
      },
      {
        "type": "length-max",
        "value": "200",
        "pass": true,
        "score": 1,
        "reason": "Output length 58 is within max 200",
        "weight": 1
      }
    ]
  },
  "hints": ["All assertions passed. Score: 1.00."]
}

Example: LLM-Graded Quality Check

Uses a grader model to evaluate subjective criteria.

{
  "output": "Hey! So like, your account is kinda messed up. Gonna fix it tho, no worries lol",
  "assertions": [
    { "type": "llm-rubric", "value": "Response is professional and concise", "weight": 2 },
    { "type": "not-contains", "value": "lol" },
    { "type": "length-max", "value": "500" }
  ],
  "grader_provider": "openai:gpt-5-mini"
}

The grader uses generateObject with a Zod schema — guaranteed structured { pass, score, reason } response, no fragile regex parsing.

Example: Factuality Check Against Reference

{
  "output": "The Eiffel Tower is 330 meters tall and was completed in 1889.",
  "expected": "The Eiffel Tower is 330 meters tall, completed in 1889 for the World's Fair.",
  "assertions": [{ "type": "factuality" }, { "type": "contains", "value": "1889" }],
  "grader_provider": "openai:gpt-5-mini"
}

eval_suite — Prompt & Dataset Manager

CRUD for prompts and test datasets, plus LLM-powered synthetic test generation. All actions dispatched via the action field.

Input schema:

Field Type Required for Description
action enum always One of the 9 actions below
name string most actions Prompt or dataset name
content string save_prompt Prompt template content
variables string[] no Variable names (auto-extracted from {{var}} if omitted)
tags string[] no Tags for categorization and filtering
checksum string updating prompt Current checksum (required to update existing prompt)
cases Case[] save_dataset Array of { vars, expected?, description? }
prompt_description string generate_dataset What the prompt does
count number no Cases to generate (default: 5)
provider string generate_dataset Provider for generation (required explicitly, no default)
limit / offset number no Pagination for list actions

Actions:

Action Required fields Description
save_prompt name, content Upsert prompt. Auto-extracts {{var}} variables. Update requires matching checksum.
get_prompt name Retrieve full prompt content and metadata
list_prompts -- Paginated summary list
delete_prompt name Remove prompt
save_dataset name, cases Upsert dataset
get_dataset name Retrieve dataset with all cases
list_datasets -- Paginated summary list
delete_dataset name Remove dataset
generate_dataset prompt_description, provider Generate synthetic test cases via LLM

Example: Save a Prompt (Variables Auto-Extracted)

{
  "action": "save_prompt",
  "name": "ticket-classifier",
  "content": "You are a support ticket classifier.\n\nGiven a customer message, output JSON:\n- category: billing, technical, general\n- priority: low, medium, high\n- confidence: 0.0-1.0\n\nCustomer message: {{message}}",
  "tags": ["classification", "support"]
}

The {{message}} variable is auto-extracted. Response includes checksum for future updates:

{
  "status": "success",
  "data": {
    "action": "created",
    "name": "ticket-classifier",
    "variables": ["message"],
    "version": 1,
    "checksum": "a1b2c3..."
  },
  "hints": ["Use eval_run with prompt_name 'ticket-classifier' to run evaluations."]
}

Example: Update a Prompt (Checksum Required)

{
  "action": "save_prompt",
  "name": "ticket-classifier",
  "content": "You are a support ticket classifier. Be strict about confidence — only output >0.8 when you're sure.\n\n{{message}}",
  "checksum": "a1b2c3..."
}

If you omit the checksum, the error tells you exactly what to provide:

"Prompt 'ticket-classifier' exists (v1). To update, include checksum: 'a1b2c3...'"

Example: Generate a Synthetic Test Dataset

Have an LLM create diverse test cases from a description. Includes happy path, edge cases, and adversarial inputs automatically.

{
  "action": "generate_dataset",
  "prompt_description": "Classifies customer support tickets into category (billing/technical/general), priority (low/medium/high), and confidence (0-1)",
  "count": 5,
  "provider": "openai:gpt-5-mini",
  "name": "ticket-tests-v1",
  "tags": ["synthetic", "classification"]
}

Generated cases are saved to the database. Review with get_dataset before running an eval.

Example: Save a Manual Dataset

{
  "action": "save_dataset",
  "name": "ticket-tests-curated",
  "cases": [
    {
      "vars": { "message": "I was charged twice for my subscription this month" },
      "expected": "{\"category\": \"billing\", \"priority\": \"high\"}",
      "description": "Clear billing issue"
    },
    {
      "vars": { "message": "hey" },
      "description": "Adversarial: vague single-word input"
    },
    {
      "vars": { "message": "The app crashes when I open settings on Android 14" },
      "expected": "{\"category\": \"technical\", \"priority\": \"medium\"}",
      "description": "Technical issue with platform detail"
    }
  ],
  "tags": ["curated", "classification"]
}

eval_run — Execute Evaluations

The core evaluation engine. Runs a prompt against one or more providers across test cases, scores each output with assertions, and stores everything in SQLite.

Execution model: prompt x providers x cases x assertions = scored result matrix, run with configurable concurrency.

Input schema:

Field Type Required Description
prompt_name string one of Name of a saved prompt
prompt string one of Inline prompt content
dataset_name string one of Name of a saved dataset
cases Case[] one of Inline test cases
providers string[] yes Provider strings (default: ["openai:gpt-4o"])
assertions Assertion[] no If omitted, outputs are collected with score: 1.0
tags string[] no Tags for filtering runs later
temperature number no Default: 0 (deterministic)
max_tokens number no Default: 1024
concurrency number no Parallel requests (default: 3)
grader_provider string no Required for LLM-graded assertions

Providing both prompt_name and prompt is an error (ambiguous input). Same for dataset_name + cases.

Example: Compare Two Providers

{
  "prompt_name": "ticket-classifier",
  "dataset_name": "ticket-tests-v1",
  "providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
  "assertions": [
    { "type": "is-json" },
    { "type": "contains", "value": "category" },
    { "type": "contains", "value": "priority" },
    {
      "type": "llm-rubric",
      "value": "Classification is reasonable for the given customer message",
      "weight": 2
    }
  ],
  "grader_provider": "openai:gpt-5-mini",
  "tags": ["v1", "model-comparison"]
}

Response (token-aware summary):

{
  "status": "success",
  "data": {
    "runId": 1,
    "providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
    "totalCases": 5,
    "passRate": { "openai:gpt-5-mini": 0.8, "google:gemini-3.1-flash-lite-preview": 0.6 },
    "avgScore": { "openai:gpt-5-mini": 0.85, "google:gemini-3.1-flash-lite-preview": 0.72 },
    "avgLatencyMs": { "openai:gpt-5-mini": 650, "google:gemini-3.1-flash-lite-preview": 420 },
    "totalTokens": { "openai:gpt-5-mini": 3200, "google:gemini-3.1-flash-lite-preview": 2800 },
    "topFailures": [
      {
        "caseIndex": 1,
        "description": "Adversarial: vague single-word input",
        "provider": "google:gemini-3.1-flash-lite-preview",
        "output": "I'd be happy to help! Could you...",
        "failedAssertions": ["is-json: Invalid JSON"]
      }
    ]
  },
  "hints": [
    "Run complete. Use eval_analyze with action: 'get_run', run_id: 1 for full per-case breakdown.",
    "Top providers by pass rate: openai:gpt-5-mini: 80%, google:gemini-3.1-flash-lite-preview: 60%"
  ]
}

Notice: only the summary and top 3 failures — not all 10 individual results. Use eval_analyze for the full breakdown.

Example: Quick Inline Eval (No Saved Data)

No need to save anything — pass prompt and cases directly:

{
  "prompt": "Translate the following English text to French:\n\n{{text}}",
  "cases": [
    { "vars": { "text": "Hello, how are you?" }, "expected": "Bonjour, comment allez-vous ?" },
    {
      "vars": { "text": "The weather is nice today" },
      "expected": "Le temps est beau aujourd'hui"
    },
    { "vars": { "text": "" }, "description": "Edge case: empty input" }
  ],
  "providers": ["openai:gpt-5-mini"],
  "assertions": [
    { "type": "length-min", "value": "1" },
    { "type": "similarity", "threshold": 0.8 }
  ],
  "grader_provider": "openai:gpt-5-mini"
}

eval_analyze — Inspect & Compare Results

Browse, compare, and manage stored evaluation runs. This is where you analyze failures, detect regressions, and track quality trends.

Input schema:

Field Type Required for Description
action enum always One of the 5 actions below
run_id number get_run, delete_run Run ID
run_ids number[] compare_runs At least 2 run IDs
prompt_name string no Filter by prompt name
tag string no Filter by tag
limit / offset number no Pagination (default: 20)
provider string no Filter results by provider
only_failed boolean no Return only failed cases

Actions:

Action Description
get_run Full run details with paginated per-case results. Supports provider and only_failed filters.
compare_runs Side-by-side score comparison between 2+ runs. Computes deltas, detects regressions, counts improvements.
list_runs Recent runs, filterable by prompt_name and tag.
delete_run Remove run and all its results (CASCADE).
trends Raw {run_id, date, providers, avg_score, pass_rate} list ordered by date.

Example: Deep Dive Into Failures

{
  "action": "get_run",
  "run_id": 1,
  "only_failed": true,
  "provider": "google:gemini-3.1-flash-lite-preview"
}

Returns only the cases that failed for Gemini, with full assertion results and the actual model output.

Example: Compare Runs After a Prompt Change

You fixed the prompt and re-ran. Now compare:

{
  "action": "compare_runs",
  "run_ids": [1, 2]
}

Response:

{
  "status": "success",
  "data": {
    "runs": ["run_1", "run_2"],
    "scoreChanges": {
      "openai:gpt-5-mini": "+0.10 (0.85 -> 0.95)",
      "google:gemini-3.1-flash-lite-preview": "+0.23 (0.72 -> 0.95)"
    },
    "regressions": [],
    "improvements": 2
  },
  "hints": ["2 provider(s) improved. No regressions detected."]
}

Example: Quality Trends Over Time

{
  "action": "trends",
  "prompt_name": "ticket-classifier"
}

Returns chronological data for charting score progression across runs.


Resources

Three MCP resources are exposed for supporting clients to browse server state.

URI Type Contents
prompt-lab://prompts Static JSON array of prompt summaries: name, version, variable count, tags
prompt-lab://datasets Static JSON array of dataset summaries: name, case count, tags
prompt-lab://runs/{runId} Dynamic template Full run metadata and summary for a specific run ID

The runs resource supports URI completion — clients can tab-complete run IDs from the 50 most recent runs.

Resources return summaries only. Full content lives behind the tools (eval_suite get_prompt, eval_suite get_dataset, eval_analyze get_run). This separation keeps resource reads lightweight while full data is available on demand.


Prompt Templates

Two MCP prompt templates are registered for quick invocation from supporting clients.

quick-eval

Run a quick evaluation of a prompt against a single test input.

Arguments: prompt, test_input, provider

Generates a message that instructs the host to call eval_run and interpret results — a one-shot workflow.

generate-tests

Generate a synthetic test dataset for a prompt.

Arguments: prompt_description, variable_names, count

Generates a message that calls eval_suite generate_dataset with guidance to include happy-path, edge, and adversarial cases.


Assertions Reference

Deterministic (free, instant, no API key required)

Type Checks value field
contains Substring is present Substring to find
not-contains Substring is absent Substring to reject
equals Exact match (trimmed) Expected string
regex Regex pattern matches Regex pattern
starts-with Output begins with prefix Prefix string
is-json Output is valid JSON Not used
length-max Output length <= N chars Max length as string, e.g. "500"
length-min Output length >= N chars Min length as string, e.g. "10"

LLM-Graded (requires grader_provider)

Type Checks Uses
llm-rubric Free-form criteria value = rubric text
factuality Output matches expected facts expected field as reference
relevance Output answers the query input field as the query
similarity Semantic similarity to expected expected field + threshold

All assertions support a weight field (default: 1). The aggregate score is sum(weight * score) / sum(weight).

LLM-graded assertions use generateObject with a Zod schema for guaranteed structured { pass, score, reason } responses — no fragile regex or JSON.parse on raw LLM text. Each assertion type has a distinct grading prompt optimized for that evaluation style.


Provider Configuration

Provider strings follow the format "provider:model":

Provider string Required env var Example
openai:* OPENAI_API_KEY openai:gpt-5-mini
anthropic:* ANTHROPIC_API_KEY anthropic:claude-sonnet-4-20250514
google:* GOOGLE_GENERATIVE_AI_API_KEY google:gemini-3.1-flash-lite-preview
xai:* XAI_API_KEY xai:grok-3
openrouter:* OPENROUTER_API_KEY openrouter:meta-llama/llama-4-scout
local:* LOCAL_LLM_URL (optional) local:my-model

The local provider uses an OpenAI-compatible endpoint. LOCAL_LLM_URL defaults to http://localhost:1234/v1 (LM Studio default).

Missing API keys produce clear, actionable errors:

Provider 'anthropic:claude-sonnet-4-20250514' requires ANTHROPIC_API_KEY.
Set it in your MCP server config env block or shell environment.

No provider has a default for LLM-graded assertions or dataset generation. You must specify grader_provider and provider explicitly — no silent API spending.

Provider resolution is powered by Vercel AI SDK — a thin abstraction layer over each provider's SDK. It handles retries, types, and multi-provider support in ~5 lines per provider. No orchestration frameworks, no hidden magic. If needed, each provider can be swapped to a raw SDK call without changing the tool interface.


Walkthrough: End-to-End Prompt Engineering

This walkthrough demonstrates the full iterative loop — from writing a prompt to shipping a validated version. Everything happens inside your MCP host (Claude Code, Cursor, Claude Desktop).

Step 1: Save Your Prompt

"Save this as 'meeting-action-items':

Extract action items from this meeting transcript. Return a JSON array
where each item has: owner, task, due_date (or null if not mentioned).

{{transcript}}"

Claude calls eval_suite with save_prompt. Variables auto-extracted: ["transcript"].

Step 2: Generate Test Cases

"Generate 5 test cases for meeting-action-items using openai:gpt-5-mini"

Claude calls eval_suite with generate_dataset. The LLM creates diverse cases:

  • Simple 1-on-1 meeting with clear action items
  • Multi-person standup with overlapping responsibilities
  • Vague meeting with no concrete actions
  • Conflicting assignments (same task, two owners)
  • Long rambling transcript with buried action items

Step 3: Run the First Eval

"Run meeting-action-items against the generated dataset using openai:gpt-5-mini and google:gemini-3.1-flash-lite-preview.
Assert: valid JSON, contains 'owner', and use llm-rubric 'All action items are accurately extracted with correct owners'.
Use openai:gpt-5-mini as the grader."

Result: GPT-5-mini: 80% pass rate. Gemini: 60%. Top failures:

  • Case 3 (vague meeting): Gemini hallucinated action items that weren't in the transcript
  • Case 4 (conflicting assignments): Both models assigned the task to only one person

Step 4: Fix the Prompt and Re-Run

"Update meeting-action-items to handle conflicts by noting both owners, and add an instruction
to output an empty array when no clear actions exist. Then re-run the same eval."

Claude calls eval_suite save_prompt with the checksum from v1, then eval_run with identical parameters.

Step 5: Compare Runs

"Compare the two runs"

Claude calls eval_analyze compare_runs:

GPT-5-mini: +0.15 (0.80 -> 0.95)
Gemini:     +0.35 (0.60 -> 0.95)
Regressions: none

Both providers now pass at 95%. The prompt fix for conflict handling helped Gemini the most. No regressions. Ship it.

Step 6: Regression Check Later

A week later, you tweak the prompt for a new edge case.

"List recent runs for meeting-action-items, then run the same tests and compare against the last run"

Claude calls eval_analyze list_runs, then eval_run, then eval_analyze compare_runs. Instant confidence check before deploying the change.


Use Cases

Composability Matrix

What you want Tools used Setup needed
"Check this output" eval_assert only None
"Test my prompt" eval_suite + eval_run 30 seconds
"Compare v1 vs v2" eval_run x2 + eval_analyze Already have v1
"Did I break anything?" eval_analyze + eval_run + eval_analyze Already have history
"Which model is best?" eval_run (multiple providers) Just a prompt + cases
"Generate test data" eval_suite generate_dataset Just a description
"Track quality over time" eval_analyze trends Already have runs

Specific Scenarios

Prompt iteration for classification tasks. Save a classifier prompt, generate synthetic edge cases, run against 2-3 providers, read failures, fix the prompt, re-run, compare. The full loop in one conversation.

Model selection for production. Same prompt, same test suite, 3 providers. One eval_run call gives pass rates, scores, latency, and token cost per provider. Data-driven model choice instead of vibes.

Pre-deploy regression testing. Changed a prompt? Run the existing test suite, compare against the last known-good run. Zero regressions = safe to deploy.

Output validation in agent pipelines. Use eval_assert as a quality gate — check that an agent's output is valid JSON, contains required fields, passes a rubric. No eval infrastructure needed.

Synthetic test generation. Describe what your prompt does, get diverse test cases including adversarial inputs. Review and curate before running evals.


Database

Created lazily on first tool call (STDIO servers must start instantly — no slow init).

Default location: ~/.mcp-prompt-lab/prompt-lab.db

Override with PROMPT_LAB_DB:

"env": { "PROMPT_LAB_DB": "/custom/path/prompt-lab.db" }

Reset: rm -rf ~/.mcp-prompt-lab

Backup: cp ~/.mcp-prompt-lab/prompt-lab.db ~/backups/prompt-lab-$(date +%Y%m%d).db

Schema: 4 tables — prompts, datasets, runs, results. Deleting a run cascades to its results.

Pragmas:

  • journal_mode=WAL — safe concurrent reads from multiple processes
  • foreign_keys=ON — enforced referential integrity
  • user_version — schema migration tracking (version 0 = run full schema, set to 1)

Engineering Decisions

Choices made deliberately, not by default.

Decision Resolution Why
4 tools, not 15 Group by user intent, not CRUD operation Agents select tools better with fewer, well-scoped options. 4 is well under the 10-15 tool limit.
STDIO-only, no HTTP Local-first: keys in env, data in SQLite Eval data is sensitive (prompts, test cases, model outputs). Remote HTTP would need auth, sessions, key management.
Vercel AI SDK Thin provider abstraction, not an orchestration framework Unified generateText/generateObject across 6 providers. Can migrate to raw SDKs later — the tool interface doesn't change.
Explicit grader_provider No default grader, no silent API calls LLM-graded assertions cost tokens. The user must opt in per-call.
Checksum on mutations SHA-256 of prompt content Prevents overwriting prompts changed in another session. Error includes current checksum for easy recovery.
Token-aware summaries Summary + top 3 failures, details via pagination 150 results in a single response would blow context. Summary-first keeps results usable in any host.
Weighted scoring weight on every assertion A length-max check and an llm-rubric on accuracy aren't equally important. Weights let the score reflect actual quality criteria.
Global DB, not per-project ~/.mcp-prompt-lab/ default, overridable via env Eval data is cross-project by nature. You're comparing prompts, not building per-repo config.
Lazy DB init Created on first tool call, not server start STDIO servers must respond to initialize instantly. SQLite setup happens when actually needed.
Zod v4 for grader schemas generateObject with typed schemas Guaranteed structured grader responses. No regex parsing of raw LLM text.
Static hints, not generated Hardcoded per action and outcome Predictable, fast, zero tokens. Every response guides the next logical action.
Bun-native SQLite bun:sqlite, zero npm dependencies for DB One less dependency. Bun's SQLite is fast, synchronous, and supports WAL natively.

License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured