autoresearch-mcp
An MCP server that implements Andrej Karpathy's autoresearch pattern for iterative experimentation, offering a composable technique catalog, experiment scaffolding, and SQLite-backed tracking for AI-assisted optimization loops.
README
autoresearch-mcp
An MCP server that brings Andrej Karpathy's autoresearch pattern to every AI coding session, with a composable technique catalog, experiment scaffolding, and SQLite + FTS5-backed tracking.
What is Autoresearch?
Autoresearch is a simple but powerful pattern popularized by Andrej Karpathy's autoresearch project, one of the most-starred AI research repositories on GitHub: give an AI agent a real experiment setup, let it modify code, prompts, or configs, run a fixed-time experiment, check whether the target metric improved, keep or discard the change, and repeat.
In Karpathy's framing, that can mean roughly 12 experiments per hour and around 100 overnight. The important idea is broader than any one implementation: if you have a measurable metric, you can ratchet toward better results.
autoresearch-mcp packages that pattern as an MCP server so any compatible AI client can discover techniques, scaffold experiments, track iterations, and accumulate meta-learning across projects.
This project is inspired by Karpathy's work, but it is not affiliated with his project and no code was copied.
Quick Start
Runtime requirements:
- MCP server (
autoresearch-mcp): requires Bun, because the server usesbun:sqlite - Skill installer (
autoresearch-install-skill): requires Node.js >= 20.19, no Bun needed - Any MCP-compatible client such as Claude Code or OpenCode
Install globally (puts both commands on your PATH):
npm install -g autoresearch-mcp
Or run without a global install:
# Start the MCP server (requires Bun)
bunx autoresearch-mcp
# Install the bundled skill (works with Node alone)
npx -p autoresearch-mcp autoresearch-install-skill
Note: npm install -g succeeds on a machine without Bun, but the autoresearch-mcp server command will not run until Bun is installed. The skill installer runs on Node alone.
Install as Skill (Recommended)
autoresearch-mcp ships with a skill file that teaches your AI agent the autoresearch methodology: when to use which technique, how to compose recipes, and how to run ratchet loops. The skill is lightweight (~100-400 tokens in context) while the MCP server provides the heavy machinery (catalog search, experiment tracking, scaffolding).
Skill + MCP = Brain + Hands
OpenCode
# Install the bundled skill into ~/.opencode/skills/autoresearch
npx -p autoresearch-mcp autoresearch-install-skill --target opencode
# If autoresearch-mcp is already installed globally (requires Bun;
# on Node-only machines use autoresearch-install-skill instead):
autoresearch-mcp install-skill --target opencode
Skills are auto-discovered from ~/.opencode/skills/. The skill lazy-loads when your agent encounters optimization problems.
Claude Code
npx -p autoresearch-mcp autoresearch-install-skill --target claude
pi.dev
pi --skill $(npm root -g)/autoresearch-mcp/skills/autoresearch/SKILL.md
The installer copies skill files by default so npx temporary package caches do not leave broken symlinks. Use --dry-run to preview changes or --overwrite to replace an existing skill directory.
Install as MCP Server (Machinery)
The MCP server provides tools and state. Install alongside the skill for full capability.
Claude Code
Add this to your MCP settings:
{
"mcpServers": {
"autoresearch": {
"command": "bunx",
"args": ["autoresearch-mcp"]
}
}
}
OpenCode
Add this to ~/.config/opencode/opencode.json:
{
"mcp": {
"autoresearch": {
"type": "local",
"command": ["bunx", "autoresearch-mcp"]
}
}
}
Once connected, ask your agent things like:
- "What autoresearch technique should I use for prompt optimization?"
- "Scaffold a code-performance experiment for this project."
- "Log this iteration result and track the total cost."
How It Works
At the core is a ratchet loop:
Edit artifact -> Run evaluator -> Score improved? -> Yes: Keep -> Repeat
-> No: Revert -> Repeat
The server gives your agent the pieces needed to run that loop in a structured way:
- Pick a technique or recipe.
- Scaffold an experiment with a program and evaluator harness.
- Run iterations against a measurable metric.
- Keep improvements, discard regressions.
- Track costs, timing, and outcomes.
- Reuse what works across future projects.
Technique Catalog
autoresearch-mcp ships with a 30-item catalog organized into four composable layers.
1. Search Strategies
These define how candidate changes are proposed.
hill-climbingevolutionarybayesian-optimizationbeam-searchmulti-armed-banditsimulated-annealingablation-eliminationself-refine
2. Evaluators
These define how candidates are scored.
benchmark-harnessbinary-evaluatorrubric-scorerllm-as-judgepairwise-comparisoncost-latency-evaluatorhuman-approval-gateregression-detector
3. Execution Patterns
These define how the loop is run and controlled.
single-ratchettwo-loopbounded-episodebranch-and-mergechampion-challengercheckpoint-and-resume
4. Recipes
Recipes compose a strategy, evaluator, and execution pattern into a ready-to-use starting point.
prompt-optimizationcode-performanceconfig-tuningcontent-revisiontest-amplificationml-trainingliterature-synthesisgeneral-ratchet
MCP Tools
The server exposes 12 MCP tools.
| Tool | Description |
|---|---|
search_techniques |
Search the catalog by query, or list all techniques when the query is empty. |
get_technique |
Return full details for a technique by ID. |
suggest_technique |
Describe a problem and get a recommended approach. |
register_experiment |
Create a tracked experiment record. |
update_experiment |
Update experiment status with automatic timestamps. |
log_result |
Log an iteration result with score, time, token, and dollar tracking. |
get_experiment |
Retrieve experiment details and optional iteration history. |
list_experiments |
List experiments filtered by status or project. |
scaffold_experiment |
Generate program.md, eval.sh, and results.tsv from a recipe. |
get_template |
Return a recipe template file. |
get_server_info |
Return server version, catalog stats, and the active database path. |
log_technique_outcome |
Record what worked for cross-project meta-learning. |
Usage Examples
Conversational workflow
You: "I want to optimize my chatbot's system prompt. I have 50 test questions."
Agent calls: suggest_technique(problem: "optimize chatbot prompt with eval set")
-> Recommends: prompt-optimization recipe
(hill-climbing + llm-as-judge + single-ratchet)
Agent calls: scaffold_experiment(recipe_id: "prompt-optimization", ...)
-> Creates: autoresearch/program.md, eval.sh, results.tsv
You: "Run the ratchet loop"
Agent: reads program.md, edits prompt, runs eval.sh, logs results...
After 10 iterations: Score improved from 62 to 94 (+52%)
Scripted MCP flow
If you prefer explicit tool orchestration, the lifecycle looks like this:
1. suggest_technique(problem="reduce API latency without hurting quality")
2. scaffold_experiment(recipe_id="code-performance", project_path="/repo", metric_name="requests/sec")
3. update_experiment(experiment_id="...", status="running")
4. log_result(iteration=1, score=1180, improved=true, change_description="inlined hot path")
5. log_result(iteration=2, score=1165, improved=false, change_description="added extra serialization")
6. get_experiment(experiment_id="...", include_results=true)
7. log_technique_outcome(technique_id="code-performance", domain="backend", outcome="success")
After scaffolding, your agent gets a working starting point:
autoresearch/program.mdfor the loop instructionsautoresearch/eval.shfor the evaluation harnessautoresearch/results.tsvfor iteration history
Example Domains
Anything with a measurable target can use the pattern.
| Domain | Target | Evaluator Example |
|---|---|---|
| Prompt engineering | System prompts | Eval set accuracy |
| Code performance | Source code | Benchmark score |
| Config tuning | Config files | Performance metric |
| Content quality | Articles and docs | Quality rubric score |
| Test coverage | Test suites | Coverage percentage |
| ML training | Training code | Validation loss |
Recipes
Recipes are the fastest way to get started because they encode a practical composition of the three lower layers:
recipe = search strategy + evaluator + execution pattern
For example:
prompt-optimizationcombines a search strategy suited to prompt mutation, an evaluator that can score prompt outputs, and a ratchet pattern that preserves improvements.code-performancepairs code changes with benchmark-driven evaluation.general-ratchetgives you a flexible default when your domain is unusual but still measurable.
You can use recipes as-is, inspect their parts with get_technique, or search the catalog to build your own combination.
Configuration
Claude Code
{
"mcpServers": {
"autoresearch": {
"command": "bunx",
"args": ["autoresearch-mcp"]
}
}
}
OpenCode
{
"mcp": {
"autoresearch": {
"type": "local",
"command": ["bunx", "autoresearch-mcp"]
}
}
}
Other MCP clients
Any client that supports launching a local stdio MCP server can use autoresearch-mcp with the same pattern:
- command:
bunx - args:
autoresearch-mcp
If your client expects a single executable command, point it at the same Bun-based invocation.
Roadmap
- Phase 0.5: Catalog discovery + FTS5 search
- Phase 1: Experiment tracking + scaffolding
- Phase 2: Skill + tests + public release (current)
- Phase 3: Autonomous runner with agent-driven execution and approval-aware loops
- Phase 4: Docker sandbox for safer code execution and isolated experiments
- Phase 5: Nightcrawler-style bounded episodes for longer autonomous optimization runs
The direction is simple: start with trustworthy building blocks, then expand toward increasingly autonomous experiment execution.
Inspired By
This project is prominently inspired by Andrej Karpathy's autoresearch work:
- GitHub: karpathy/autoresearch
- Posts and discussion: @karpathy on X
autoresearch-mcp adapts the underlying pattern for MCP-native workflows so coding agents can use it across prompts, code, configs, content, tests, and research tasks.
It is inspired by Karpathy's idea, not affiliated with his project, and no code was copied.
Contributing
Contributions are welcome. Please see CONTRIBUTING.md.
License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.