MCP Servers

local-llm-mcp

Enables coding agents like Claude Code and Codex to offload boilerplate generation, summarization, and other bounded text tasks to local or cheap cloud LLMs, keeping the frontier agent in charge of judgment and code edits.

README

local-llm-mcp

local-llm-mcp hero

English · <a href="README.zh-TW.md">繁體中文</a>

Your coding agent is brilliant at deciding what to build — and overqualified for typing the boilerplate. local-llm-mcp lets Claude Code, Codex, and any MCP client hand the boring, token-heavy work to a model running on your own machine (or a cheap cloud one), while the frontier agent keeps doing the thinking.

terminal demo

TL;DR

Claude Code / Codex   →   decides, reads the repo, edits files, runs tests, reviews
local-llm-mcp         →   routes one bounded task out through an MCP tool
LM Studio / Ollama    →   drafts the boilerplate, tests, docs, summaries — for free

The smart agent stays in charge. The cheap model just hands back text. You stop paying frontier prices to scaffold a pytest file.

git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh        # one command: venv, register with Claude Code + Codex, smoke test

The problem

Frontier coding agents are metered, and they're worth it — for judgment. Reading a repo, planning a change, reviewing a diff, deciding what's safe to merge.

But a huge slice of what they actually emit is bounded, low-risk generation:

the first draft of a function you're going to rewrite anyway
a pytest skeleton
boilerplate, config, glue code
a 600-line file summarized down to 10 bullets
"give me three alternative implementations"

You're paying premium per-token rates for work a 7B model on your laptop does fine. A local model costs $0 per token. The math only goes one way.

How it works

local-llm-mcp is a tiny MCP server. It exposes your local and cheap-cloud LLMs as tools the main agent can call. The delegated model only returns text — it can't read your repo, edit files, or run commands. That boundary is the whole point.

delegation boundary

The main agent decides when to delegate, reviews what comes back, and remains 100% responsible for anything that touches your code.

Quick start

git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh

Then open a new Claude Code or Codex session. That's it.

setup.sh will:

Create a .venv and install the project.
Create keys.json and custom_backends.json from examples.
Pick a RAM safety threshold for your machine.
Register the MCP server with both Claude Code and Codex (if their CLIs are installed).
Run a smoke test.

No API keys required to start — local backends work out of the box.

✅ Delegate this / 🚫 keep this

✅ Good for the worker model	🚫 Keep with the main agent
README / docstring first drafts	Final architecture decisions
Boilerplate, config, glue code	Security & correctness sign-off
`pytest` / `unittest` scaffolds	Anything that edits your repo
Long-file summaries	Running shell commands
Repetitive format conversions	Applying a patch unreviewed
"Sketch 3 alternative approaches"	Judgment calls of any kind

Rule of thumb: if a wrong answer is cheap to catch, delegate it. If it's expensive to catch, don't.

Supported backends

Local backends need nothing but a running server. Cloud backends are optional fallbacks and read their key from an env var or keys.json.

Backend	Type	Protocol	Default URL	Default model	Key
`lmstudio`	local	OpenAI	`http://localhost:1234/v1`	`qwen/qwen3-coder-next`	—
`ollama`	local	OpenAI	`http://localhost:11434/v1`	`qwen2.5-coder:7b`	—
`vllm`	local	OpenAI	`http://localhost:8001/v1`	auto	—
`llamacpp`	local	OpenAI	`http://localhost:8080/v1`	auto	—
`ds4`	local	OpenAI	`http://127.0.0.1:8000/v1`	auto	—
`deepseek`	cloud	OpenAI	`https://api.deepseek.com/v1`	`deepseek-v4-flash`	`DEEPSEEK_API_KEY`
`openrouter`	cloud	OpenAI	`https://openrouter.ai/api/v1`	`anthropic/claude-sonnet-4`	`OPENROUTER_API_KEY`
`groq`	cloud	OpenAI	`https://api.groq.com/openai/v1`	`openai/gpt-oss-120b`	`GROQ_API_KEY`
`cerebras`	cloud	OpenAI	`https://api.cerebras.ai/v1`	`gpt-oss-120b`	`CEREBRAS_API_KEY`
`agnes`	cloud	OpenAI	`https://apihub.agnes-ai.com/v1`	`agnes-2.0-flash`	`AGNES_API_KEY`
`minimax`	cloud	Anthropic	`https://api.minimaxi.com/anthropic`	`MiniMax-M3`	`MINIMAX_API_KEY`

Every base URL and default model can be overridden by env vars, e.g. OLLAMA_BASE_URL, DEEPSEEK_DEFAULT_MODEL. Need something else? Add a custom backend — no Python required.

MCP tools

Tool	Purpose
`ask_local_model`	Send a prompt to a backend, get back text + usage metadata.
`list_backends`	Show configured backends, URLs, protocols, key status.
`local_status`	Memory, guard state, backend reachability, config paths.
`list_local_models` / `list_models`	List model IDs from backends that expose `GET /models`.
`set_backend`	Add, update, or remove a custom backend live.
`refresh_backends`	Reload `custom_backends.json` without restarting.
`set_guard`	Change the RAM / exclusivity guards live.
`set_system_prefix`	Pin a system prefix for prompt-cache-friendly cloud calls.

Talking to it

Just tell the agent what to delegate:

Use ask_local_model with backend="ollama" to draft a pytest suite for this module.
Don't apply it — review it first, then edit the repo yourself.

Call local_status. If DS4 is running, use backend="ds4" for boilerplate.
Otherwise fall back to backend="ollama".

More patterns in examples/claude-code-prompts.md.

Safety model

Local models are happy to OOM your machine. Two guards stop that:

RAM valve — local calls are refused when free memory drops below LOCAL_LLM_MIN_FREE_GB.
Exclusive backend — when a heavy local backend (ds4 by default) is up, other local backends are blocked so they don't fight for RAM.

Tune them live, no restart:

set_guard(min_free_gb=8)
set_guard(exclusive_backend="none")
set_guard(enforce=0)

Secrets never enter git. keys.json, config.json, and custom_backends.json are gitignored; keys load from env vars or a chmod 600 file. Cloud backends skip the RAM guard — but they send your prompts to a third party and may cost money, so read SECURITY.md before pointing one at proprietary code.

Custom backends

Any OpenAI- or Anthropic-compatible endpoint works. Add it from the tool:

set_backend(name="my_qwen", base_url="http://localhost:9000/v1", default_model="qwen3-coder", local=1, protocol="openai")

...or drop it in custom_backends.json and call refresh_backends. See examples/custom_backends.openrouter.json.

Does it actually save money?

It depends entirely on your workload — so this repo refuses to print a fake percentage. Instead it ships a harness so you can measure your numbers:

python scripts/benchmark.py --backend ollama --model qwen2.5-coder:7b --out results.jsonl

Then compare premium-only vs. delegated mode using BENCHMARK.md and the runbook. Report what you find; don't take a number on faith — not even ours.

Manual install

python3 -m venv .venv
.venv/bin/python -m pip install -e .

# Claude Code
claude mcp add local-llm -s user -e LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"

# Codex
codex mcp add local-llm --env LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"

Tests

python -m unittest discover -s tests -v
python scripts/smoke_test.py

CI runs both on Python 3.10, 3.11, and 3.12.

FAQ

Does the local model touch my files? No. It returns text only. Every edit goes through the main agent.

Do I need a GPU / a big Mac? No. Small coder models (7B) run on modest hardware, and the RAM valve keeps you from OOMing. No local model handy? Point it at a cheap cloud backend.

Is this a fusion model or an autonomous agent? Neither. It's a delegation layer — a tool your existing agent calls.

Why not just switch Claude Code to a cheaper model entirely? Because you want the frontier model's judgment and the cheap model's typing. This keeps both.

Windows / Linux? The server is cross-platform; the RAM guard reads vm_stat on macOS and /proc/meminfo on Linux. Shell helpers are macOS/zsh-flavored.

Related projects

qwable — a local multi-model gateway and agent runtime for Codex & Claude Code on Apple Silicon.
Conclava — a council of local LLMs with task-aware routing and multi-model deliberation.

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. Launch notes live in docs/.

License

MIT — see LICENSE.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured