autoresearch-mcp

autoresearch-mcp

An MCP server that implements Andrej Karpathy's autoresearch pattern for iterative experimentation, offering a composable technique catalog, experiment scaffolding, and SQLite-backed tracking for AI-assisted optimization loops.

Category
Visit Server

README

autoresearch-mcp

An MCP server that brings Andrej Karpathy's autoresearch pattern to every AI coding session, with a composable technique catalog, experiment scaffolding, and SQLite + FTS5-backed tracking.

What is Autoresearch?

Autoresearch is a simple but powerful pattern popularized by Andrej Karpathy's autoresearch project, one of the most-starred AI research repositories on GitHub: give an AI agent a real experiment setup, let it modify code, prompts, or configs, run a fixed-time experiment, check whether the target metric improved, keep or discard the change, and repeat.

In Karpathy's framing, that can mean roughly 12 experiments per hour and around 100 overnight. The important idea is broader than any one implementation: if you have a measurable metric, you can ratchet toward better results.

autoresearch-mcp packages that pattern as an MCP server so any compatible AI client can discover techniques, scaffold experiments, track iterations, and accumulate meta-learning across projects.

This project is inspired by Karpathy's work, but it is not affiliated with his project and no code was copied.

Quick Start

Runtime requirements:

  • MCP server (autoresearch-mcp): requires Bun, because the server uses bun:sqlite
  • Skill installer (autoresearch-install-skill): requires Node.js >= 20.19, no Bun needed
  • Any MCP-compatible client such as Claude Code or OpenCode

Install globally (puts both commands on your PATH):

npm install -g autoresearch-mcp

Or run without a global install:

# Start the MCP server (requires Bun)
bunx autoresearch-mcp

# Install the bundled skill (works with Node alone)
npx -p autoresearch-mcp autoresearch-install-skill

Note: npm install -g succeeds on a machine without Bun, but the autoresearch-mcp server command will not run until Bun is installed. The skill installer runs on Node alone.

Install as Skill (Recommended)

autoresearch-mcp ships with a skill file that teaches your AI agent the autoresearch methodology: when to use which technique, how to compose recipes, and how to run ratchet loops. The skill is lightweight (~100-400 tokens in context) while the MCP server provides the heavy machinery (catalog search, experiment tracking, scaffolding).

Skill + MCP = Brain + Hands

OpenCode

# Install the bundled skill into ~/.opencode/skills/autoresearch
npx -p autoresearch-mcp autoresearch-install-skill --target opencode

# If autoresearch-mcp is already installed globally (requires Bun;
# on Node-only machines use autoresearch-install-skill instead):
autoresearch-mcp install-skill --target opencode

Skills are auto-discovered from ~/.opencode/skills/. The skill lazy-loads when your agent encounters optimization problems.

Claude Code

npx -p autoresearch-mcp autoresearch-install-skill --target claude

pi.dev

pi --skill $(npm root -g)/autoresearch-mcp/skills/autoresearch/SKILL.md

The installer copies skill files by default so npx temporary package caches do not leave broken symlinks. Use --dry-run to preview changes or --overwrite to replace an existing skill directory.

Install as MCP Server (Machinery)

The MCP server provides tools and state. Install alongside the skill for full capability.

Claude Code

Add this to your MCP settings:

{
  "mcpServers": {
    "autoresearch": {
      "command": "bunx",
      "args": ["autoresearch-mcp"]
    }
  }
}

OpenCode

Add this to ~/.config/opencode/opencode.json:

{
  "mcp": {
    "autoresearch": {
      "type": "local",
      "command": ["bunx", "autoresearch-mcp"]
    }
  }
}

Once connected, ask your agent things like:

  • "What autoresearch technique should I use for prompt optimization?"
  • "Scaffold a code-performance experiment for this project."
  • "Log this iteration result and track the total cost."

How It Works

At the core is a ratchet loop:

Edit artifact -> Run evaluator -> Score improved? -> Yes: Keep -> Repeat
                                                  -> No: Revert -> Repeat

The server gives your agent the pieces needed to run that loop in a structured way:

  1. Pick a technique or recipe.
  2. Scaffold an experiment with a program and evaluator harness.
  3. Run iterations against a measurable metric.
  4. Keep improvements, discard regressions.
  5. Track costs, timing, and outcomes.
  6. Reuse what works across future projects.

Technique Catalog

autoresearch-mcp ships with a 30-item catalog organized into four composable layers.

1. Search Strategies

These define how candidate changes are proposed.

  • hill-climbing
  • evolutionary
  • bayesian-optimization
  • beam-search
  • multi-armed-bandit
  • simulated-annealing
  • ablation-elimination
  • self-refine

2. Evaluators

These define how candidates are scored.

  • benchmark-harness
  • binary-evaluator
  • rubric-scorer
  • llm-as-judge
  • pairwise-comparison
  • cost-latency-evaluator
  • human-approval-gate
  • regression-detector

3. Execution Patterns

These define how the loop is run and controlled.

  • single-ratchet
  • two-loop
  • bounded-episode
  • branch-and-merge
  • champion-challenger
  • checkpoint-and-resume

4. Recipes

Recipes compose a strategy, evaluator, and execution pattern into a ready-to-use starting point.

  • prompt-optimization
  • code-performance
  • config-tuning
  • content-revision
  • test-amplification
  • ml-training
  • literature-synthesis
  • general-ratchet

MCP Tools

The server exposes 12 MCP tools.

Tool Description
search_techniques Search the catalog by query, or list all techniques when the query is empty.
get_technique Return full details for a technique by ID.
suggest_technique Describe a problem and get a recommended approach.
register_experiment Create a tracked experiment record.
update_experiment Update experiment status with automatic timestamps.
log_result Log an iteration result with score, time, token, and dollar tracking.
get_experiment Retrieve experiment details and optional iteration history.
list_experiments List experiments filtered by status or project.
scaffold_experiment Generate program.md, eval.sh, and results.tsv from a recipe.
get_template Return a recipe template file.
get_server_info Return server version, catalog stats, and the active database path.
log_technique_outcome Record what worked for cross-project meta-learning.

Usage Examples

Conversational workflow

You: "I want to optimize my chatbot's system prompt. I have 50 test questions."

Agent calls: suggest_technique(problem: "optimize chatbot prompt with eval set")
-> Recommends: prompt-optimization recipe
   (hill-climbing + llm-as-judge + single-ratchet)

Agent calls: scaffold_experiment(recipe_id: "prompt-optimization", ...)
-> Creates: autoresearch/program.md, eval.sh, results.tsv

You: "Run the ratchet loop"
Agent: reads program.md, edits prompt, runs eval.sh, logs results...

After 10 iterations: Score improved from 62 to 94 (+52%)

Scripted MCP flow

If you prefer explicit tool orchestration, the lifecycle looks like this:

1. suggest_technique(problem="reduce API latency without hurting quality")
2. scaffold_experiment(recipe_id="code-performance", project_path="/repo", metric_name="requests/sec")
3. update_experiment(experiment_id="...", status="running")
4. log_result(iteration=1, score=1180, improved=true, change_description="inlined hot path")
5. log_result(iteration=2, score=1165, improved=false, change_description="added extra serialization")
6. get_experiment(experiment_id="...", include_results=true)
7. log_technique_outcome(technique_id="code-performance", domain="backend", outcome="success")

After scaffolding, your agent gets a working starting point:

  • autoresearch/program.md for the loop instructions
  • autoresearch/eval.sh for the evaluation harness
  • autoresearch/results.tsv for iteration history

Example Domains

Anything with a measurable target can use the pattern.

Domain Target Evaluator Example
Prompt engineering System prompts Eval set accuracy
Code performance Source code Benchmark score
Config tuning Config files Performance metric
Content quality Articles and docs Quality rubric score
Test coverage Test suites Coverage percentage
ML training Training code Validation loss

Recipes

Recipes are the fastest way to get started because they encode a practical composition of the three lower layers:

recipe = search strategy + evaluator + execution pattern

For example:

  • prompt-optimization combines a search strategy suited to prompt mutation, an evaluator that can score prompt outputs, and a ratchet pattern that preserves improvements.
  • code-performance pairs code changes with benchmark-driven evaluation.
  • general-ratchet gives you a flexible default when your domain is unusual but still measurable.

You can use recipes as-is, inspect their parts with get_technique, or search the catalog to build your own combination.

Configuration

Claude Code

{
  "mcpServers": {
    "autoresearch": {
      "command": "bunx",
      "args": ["autoresearch-mcp"]
    }
  }
}

OpenCode

{
  "mcp": {
    "autoresearch": {
      "type": "local",
      "command": ["bunx", "autoresearch-mcp"]
    }
  }
}

Other MCP clients

Any client that supports launching a local stdio MCP server can use autoresearch-mcp with the same pattern:

  • command: bunx
  • args: autoresearch-mcp

If your client expects a single executable command, point it at the same Bun-based invocation.

Roadmap

  • Phase 0.5: Catalog discovery + FTS5 search
  • Phase 1: Experiment tracking + scaffolding
  • Phase 2: Skill + tests + public release (current)
  • Phase 3: Autonomous runner with agent-driven execution and approval-aware loops
  • Phase 4: Docker sandbox for safer code execution and isolated experiments
  • Phase 5: Nightcrawler-style bounded episodes for longer autonomous optimization runs

The direction is simple: start with trustworthy building blocks, then expand toward increasingly autonomous experiment execution.

Inspired By

This project is prominently inspired by Andrej Karpathy's autoresearch work:

autoresearch-mcp adapts the underlying pattern for MCP-native workflows so coding agents can use it across prompts, code, configs, content, tests, and research tasks.

It is inspired by Karpathy's idea, not affiliated with his project, and no code was copied.

Contributing

Contributions are welcome. Please see CONTRIBUTING.md.

License

Apache-2.0

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured