Agent Lab

Agent Lab

Run and test agentic systems in isolated Docker sandboxes, varying system prompts, models, and task prompts while capturing full behavior traces via MCP tools.

Category
Visit Server

README

Agent Lab

Run and test agentic systems in isolation. Agent Lab runs OpenCode in a Docker sandbox ("vacuum") with controlled settings and lets you observe how an agent behaves under varied system prompts, models, and task prompts — one run or many in parallel. It is built primarily to be called by agents (over MCP), and secondarily by humans (CLI).

  • Vary system prompt / model / task prompt; run isolated, capture the full behavior trace.
  • Two interfaces over one engine: MCP (stdio) and CLI — both agent-friendly.
  • Three network modes and guaranteed sandbox teardown.

Prerequisites

  1. Bun 1.x — bun --version
  2. Docker running — docker --version
  3. OpenCode configured on the host — a provider set up in ~/.config/opencode (auth in ~/.local/share/opencode). These are mounted read-only into each sandbox; nothing is baked into the image.

Install

Pick one. All three give you the agent-lab (CLI) and agent-lab-mcp (MCP server) commands. Docker (or the microsandbox runtime) and the sandbox image are separate prerequisites — see below.

npm (needs Node ≥ 22):

npm install -g agent-lab-opencode
# or run without installing:  npx -y agent-lab-opencode-mcp

Standalone binary (no Node/Bun required) — download for your platform from the latest release, e.g.:

curl -fsSL -o agent-lab https://github.com/ShutovKS/agent-lab-opencode/releases/latest/download/agent-lab-darwin-arm64
chmod +x agent-lab

From source (Bun):

bun install
bun link                                             # exposes `agent-lab` + `agent-lab-mcp` on PATH

Get the sandbox image (opencode serve) — either pull the published multi-arch image:

docker pull ghcr.io/shutovks/agent-lab-opencode:latest
docker tag ghcr.io/shutovks/agent-lab-opencode:latest agent-lab-opencode:latest

…or build it locally:

docker build -t agent-lab-opencode:latest docker/

The engine, CLI, and MCP server all run on the host (where Docker + your OpenCode config live). Experiments run inside isolated containers. Runs are persisted under runs/<runId>/ relative to the working directory the server/CLI is launched from.

Use from an agent — MCP (recommended)

Agent Lab exposes an MCP stdio server with four tools:

Tool Arguments Returns
run_experiment systemPrompt, model, taskPrompt, image?, networkAllowlist?, networkMode?, timeoutMs?, concurrency? runId + status
list_runs known runs
get_run runId full run record + trace (steps, tool calls, tokens, output, git diff)
compare_runs runIds[] (≥2) structural behavior diff vs. the first (baseline)

Claude Code

This repo ships a .mcp.json, so opening the project in Claude Code registers the server automatically. To use it from any project after bun link:

{
  "mcpServers": {
    "agent-lab": {
      "command": "agent-lab-mcp"
    }
  }
}

OpenCode

In opencode.json (or ~/.config/opencode/opencode.jsonc):

{
  "mcp": {
    "agent-lab": {
      "type": "local",
      "command": ["agent-lab-mcp"]
    }
  }
}

Typical agent flow

  1. run_experiment with prompt variant A → runId_A
  2. run_experiment with prompt variant B → runId_B
  3. compare_runs [runId_A, runId_B] → see which variant used fewer steps/tokens or a different tool sequence. Results come back as text and structuredContent (machine-readable).

Use from a shell — CLI

Agents with a shell tool (and humans) can call the CLI; every command prints parseable JSON.

agent-lab run --system "You are careful." --model cpa/glm-5.2 --task "Refactor the parser."
agent-lab run --config matrix.json --concurrency 3   # variation matrix, run in parallel
agent-lab run --from <runId>                          # replay a stored experiment
agent-lab list
agent-lab show <runId>
agent-lab compare <runId-a> <runId-b>

Config file (--config) is either a single definition or a variation matrix:

{
  "base": {
    "systemPrompt": "You are a concise agent.",
    "model": "cpa/glm-5.2",
    "taskPrompt": "placeholder",
    "sandbox": { "image": "agent-lab-opencode:latest", "networkAllowlist": ["cpa.funxyz.fun"], "timeoutMs": 120000 }
  },
  "variations": { "taskPrompt": ["Task A", "Task B"] }
}

Sandbox backends

Set backend on the sandbox options:

  • docker (default) — one container per run; strong FS/PID/network isolation; the vacuum network mode is enforced with an in-container iptables allowlist. Requires Docker.
  • microsandbox — a libkrun microVM per run, no Docker daemon. Same behavior behind the same contract (port publish, NetworkPolicy egress allowlist, guaranteed teardown). Requires the microsandbox runtime (curl -fsSL https://install.microsandbox.dev | sh) and a registry image (microsandbox pulls images from a registry, not a local Docker build), on macOS Apple Silicon or Linux+KVM. The SDK is lazy-loaded, so the Docker path never touches it.

Network modes

Set networkMode on the sandbox options:

  • open (default) — bridge networking; the agent can reach its LLM. Fast, egress open.
  • vacuum — strict deny-by-default egress via an in-container iptables allowlist (only DNS + the resolved allowlist hosts, e.g. the LLM endpoint + opencode infra). IPv6 fails closed.

What gets captured (RunTrace)

runId, experiment metadata, status (success/error/timeout), timings, ordered steps (assistant messages + tool calls with ok/error), tokenUsage, finalOutput (text + git diff), and error/partial when relevant.

More

  • docs/LIVE_RUN.md — end-to-end live run walkthrough.
  • docs/ — GRACE artifacts (requirements, technology, development plan, verification plan, knowledge graph). AGENTS.md — engineering protocol.

Known limitations

  • Teardown is guaranteed on normal, error, timeout, and container-crash paths, but not if the host agent-lab process is hard-killed (SIGKILL). Containers are labeled agent-lab.sandbox=1 for cleanup: docker ps -aq --filter label=agent-lab.sandbox=1 | xargs docker rm -f.
  • Vacuum: IPv6 is only reachable under a non-default docker IPv6 setup; DNS exfiltration to the configured resolver remains theoretically possible.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured