llm-gateway-mcp
A self-hostable MCP server that routes prompts to multiple LLM providers using declarative policies, with multi-role orchestration for independence and verification.
README
llm-gateway-mcp
A small, self-hostable MCP server that routes prompts to multiple LLM providers by a declarative policy, with multi-role orchestration inspired by two Sakana AI papers. Clone it, drop in your own API keys, and point any MCP client at it.
The caller never picks a model. It names a
task_class; the gateway imposes the right model, applies cost/reliability policies, and (optionally) runs a plan → execute → verify pipeline across several models.
- 4 pluggable providers — OpenAI, Anthropic, Google Gemini (API key, not Vertex), DeepSeek. Each reads its own key from the environment.
- Declarative routing (
routing.yaml) — model × provider × max_tokens per task_class, with cost preflight, circuit breaker, retry/backoff, and per-task fallback chains. - Sakana-inspired orchestration that actually runs — independence certification, Thinker/Worker/Verifier roles, and a compose pipeline with controlled per-step visibility and failure-gated re-planning.
- No keys? Still testable.
smoke_test.pyand thetests/suite mock every provider call, so the whole thing runs green offline.
The orchestration features (the interesting part)
These implement, in a domain-agnostic way, ideas from:
- "TRINITY: An Evolved LLM Coordinator" — Sakana AI, arXiv:2512.04695, ICLR 2026
- "Learning to Orchestrate Agents in Natural Language with the Conductor" — Sakana AI, arXiv:2512.04388, ICLR 2026
A key adaptation: the papers optimize synergy (workers reading each other to converge). For cross-checking we often want the opposite — independence — so that agreement between models is evidence, not an echo. This gateway keeps the two planes separate on purpose.
| # | Feature | Where | What it does |
|---|---|---|---|
| 1 | Declarative routing by task_class |
routing.yaml, server.py |
model × provider × max_tokens, cost preflight before dispatch. |
| 2 | Independence certification (Conductor T-02 access_list / visibility) | orchestration/independence.py |
For blind parallel panels, proves each member saw only the original prompt and stamps independence_certified into meta. enforce: hard fails closed. |
| 3 | Thinker / Worker / Verifier roles (Trinity T-03) | orchestration/roles.py, routing.yaml |
Per-role instruction templates + a configurable role → model table. |
| 4 | compose: plan → execute → verify | orchestration/compose.py |
One model plans, another executes, a third verifies. The verifier is blind to the plan and judges the artifact against the original task. One failure-gated re-plan (cap 1). |
| 5 | Depth by difficulty | routing.yaml/orchestration.depth |
trivial = 1 step, standard = +verify, complex = full pipeline. A declared parameter — not an LLM difficulty classifier. |
Plus the generic reliability policies: cost preflight + caps, circuit breaker, retry with backoff, and per-task fallback.
Architecture
llm-gateway-mcp/
├── server.py # FastMCP entry: llm_route, llm_orchestrate, llm_routing_info
├── routing.yaml # declarative policy (task_classes, panels, visibility, roles, depth)
├── providers/ # one adapter per provider, shared 3-state response envelope
│ ├── __init__.py # registry + dispatch()
│ ├── openai.py # OPENAI_API_KEY (httpx, Chat Completions)
│ ├── anthropic.py # ANTHROPIC_API_KEY (official anthropic SDK, Messages API)
│ ├── gemini.py # GEMINI_API_KEY (google-genai, API key — NOT Vertex)
│ └── deepseek.py # DEEPSEEK_API_KEY (httpx, OpenAI-compatible)
├── policies/ # generic, provider-agnostic
│ ├── cost_estimator.py # preflight max-cost projection
│ ├── cost_ledger.py # optional SQLite spend ledger + caps + kill switches
│ ├── circuit_breaker.py # per (provider, model) breaker
│ ├── retry_backoff.py # transient-only retry
│ ├── error_taxonomy.py # normalize provider errors
│ └── fallback.py # per-task_class fallback chains
├── orchestration/ # the Sakana-inspired layer
│ ├── independence.py # access_list / visibility certification
│ ├── roles.py # Thinker/Worker/Verifier + depth
│ └── compose.py # plan → execute → verify pipeline
├── smoke_test.py # offline structural test (mocks every model call)
└── tests/ # pytest suite (offline)
Response envelope (every provider, every tool):
{ "status": "success",
"data": { "text": "..." },
"meta": { "provider": "...", "model": "...", "latency_ms": 0,
"tokens": {"input": 0, "output": 0, "total": 0},
"cost_usd_approx": 0.0, "task_class": "..." } }
Install & run
git clone <your-fork-url> llm-gateway-mcp
cd llm-gateway-mcp
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt # or: pip install -e ".[dev]"
cp .env.example .env # then put YOUR keys in .env
Verify it works with no keys needed (everything is mocked):
python smoke_test.py # 9/9 PASS
pytest -q # 22 passed
Run as an MCP server (stdio):
python server.py
Register it with an MCP client (example mcp.json entry):
{
"mcpServers": {
"llm-gateway": {
"command": "python",
"args": ["/absolute/path/to/llm-gateway-mcp/server.py"]
}
}
}
You only need keys for the providers your routing.yaml actually targets.
One key is enough to start — any single-model
task_classworks with just that provider's key. But the real value here is the cross-model orchestration (panels, blind triangulation, plan→execute→verify), which needs 2+ different providers to shine — agreement across independent models is only evidence if the models are actually different. So: 1 key works, 2 or more is recommended.
Tools
llm_route(task_class, prompt, system?, max_tokens?, override_model?)
Routes by task_class. Single-model classes return one answer; panel classes
(those with members: in routing.yaml, e.g. dual_opinion, triple_review)
run every member in parallel on the same original prompt and certify
independence in meta.
llm_route(task_class="general_reasoning", prompt="Plan a migration from X to Y.")
llm_route(task_class="triple_review", prompt="Is this argument sound? ...")
# -> data.members = [3 independent answers], meta.visibility.independence_certified = true
llm_orchestrate(task, depth?)
Runs the plan → execute → verify pipeline. depth ∈ trivial | standard | complex.
llm_orchestrate(task="Draft a concise refund policy for a SaaS product.", depth="complex")
# -> data: { artifact, plan, verdict }, meta: { steps, rounds, visibility }
llm_routing_info()
Returns the active policy: version, task_classes, providers, panels, visibility contracts, orchestration depths, and current circuit-breaker state.
Configuration
Everything routable lives in routing.yaml — edit it freely:
defaults.<task_class>→{provider, model, max_tokens}(ormembers:for a panel)cost_preflight→warn_usd/block_usdthresholdsvisibility.<task_class>→mode: blind,enforce: hard|softfallback.<task_class>→ ordered alternates (empty = no fallback)orchestration→roles, per-roleinstructions, anddepthtable
Optional spend controls (env, disabled by default — see .env.example):
LLM_GATEWAY_LEDGER, LLM_GATEWAY_CAP_TOTAL_MONTHLY,
LLM_GATEWAY_CAP_<PROVIDER>_MONTHLY, plus kill switches
LLM_GATEWAY_DISABLED and LLM_GATEWAY_EXPENSIVE_DISABLED.
⚠️ Pricing in each
providers/*.pyMODELStable is illustrative. Verify against each provider's live pricing before trusting cost preflight in production.
Providers at a glance
| Provider | Env var | Transport | Example models |
|---|---|---|---|
| OpenAI | OPENAI_API_KEY |
httpx (Chat Completions) | gpt-4o, gpt-4o-mini, o3-mini |
| Anthropic | ANTHROPIC_API_KEY |
anthropic SDK (Messages) |
claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5 |
| Google Gemini | GEMINI_API_KEY |
google-genai (API key) |
gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite |
| DeepSeek | DEEPSEEK_API_KEY |
httpx (OpenAI-compatible) | deepseek-chat, deepseek-reasoner |
Adding a provider: drop a module in providers/ exposing MODELS and an
async def complete(messages, model, max_tokens, **kwargs) returning the shared
envelope, then register it in providers/__init__.py.
License
MIT © Felipe Márquez. See LICENSE.
Paper credits: TRINITY (arXiv:2512.04695) and Conductor (arXiv:2512.04388), Sakana AI, ICLR 2026. This project implements ideas from those papers in a generic form; it is not affiliated with or endorsed by Sakana AI.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.