local-llm-mcp
Enables coding agents like Claude Code and Codex to offload boilerplate generation, summarization, and other bounded text tasks to local or cheap cloud LLMs, keeping the frontier agent in charge of judgment and code edits.
README
local-llm-mcp

<p align="center"> <a href="LICENSE"><img alt="License: MIT" src="https://img.shields.io/badge/License-MIT-black.svg"></a> <img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-blue.svg"> <a href="https://modelcontextprotocol.io"><img alt="MCP" src="https://img.shields.io/badge/MCP-compatible-6E56CF.svg"></a> <img alt="Works with Claude Code & Codex" src="https://img.shields.io/badge/works%20with-Claude%20Code%20%7C%20Codex-0a7.svg"> <a href="https://github.com/HenryLinyy/local-llm-mcp/actions/workflows/ci.yml"><img alt="CI" src="https://github.com/HenryLinyy/local-llm-mcp/actions/workflows/ci.yml/badge.svg"></a> <a href="CONTRIBUTING.md"><img alt="PRs welcome" src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg"></a> </p>
<p align="center"><b>English</b> · <a href="README.zh-TW.md">繁體中文</a></p>
Your coding agent is brilliant at deciding what to build — and overqualified for typing the boilerplate.
local-llm-mcplets Claude Code, Codex, and any MCP client hand the boring, token-heavy work to a model running on your own machine (or a cheap cloud one), while the frontier agent keeps doing the thinking.

TL;DR
Claude Code / Codex → decides, reads the repo, edits files, runs tests, reviews
local-llm-mcp → routes one bounded task out through an MCP tool
LM Studio / Ollama → drafts the boilerplate, tests, docs, summaries — for free
The smart agent stays in charge. The cheap model just hands back text. You stop paying frontier prices to scaffold a pytest file.
git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh # one command: venv, register with Claude Code + Codex, smoke test
The problem
Frontier coding agents are metered, and they're worth it — for judgment. Reading a repo, planning a change, reviewing a diff, deciding what's safe to merge.
But a huge slice of what they actually emit is bounded, low-risk generation:
- the first draft of a function you're going to rewrite anyway
- a
pytestskeleton - boilerplate, config, glue code
- a 600-line file summarized down to 10 bullets
- "give me three alternative implementations"
You're paying premium per-token rates for work a 7B model on your laptop does fine. A local model costs $0 per token. The math only goes one way.
How it works
local-llm-mcp is a tiny MCP server. It exposes your local and cheap-cloud LLMs as tools the main agent can call. The delegated model only returns text — it can't read your repo, edit files, or run commands. That boundary is the whole point.

The main agent decides when to delegate, reviews what comes back, and remains 100% responsible for anything that touches your code.
Quick start
git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh
Then open a new Claude Code or Codex session. That's it.
setup.sh will:
- Create a
.venvand install the project. - Create
keys.jsonandcustom_backends.jsonfrom examples. - Pick a RAM safety threshold for your machine.
- Register the MCP server with both Claude Code and Codex (if their CLIs are installed).
- Run a smoke test.
No API keys required to start — local backends work out of the box.
✅ Delegate this / 🚫 keep this
| ✅ Good for the worker model | 🚫 Keep with the main agent |
|---|---|
| README / docstring first drafts | Final architecture decisions |
| Boilerplate, config, glue code | Security & correctness sign-off |
pytest / unittest scaffolds |
Anything that edits your repo |
| Long-file summaries | Running shell commands |
| Repetitive format conversions | Applying a patch unreviewed |
| "Sketch 3 alternative approaches" | Judgment calls of any kind |
Rule of thumb: if a wrong answer is cheap to catch, delegate it. If it's expensive to catch, don't.
Supported backends
Local backends need nothing but a running server. Cloud backends are optional fallbacks and read their key from an env var or keys.json.
| Backend | Type | Protocol | Default URL | Default model | Key |
|---|---|---|---|---|---|
lmstudio |
local | OpenAI | http://localhost:1234/v1 |
qwen/qwen3-coder-next |
— |
ollama |
local | OpenAI | http://localhost:11434/v1 |
qwen2.5-coder:7b |
— |
vllm |
local | OpenAI | http://localhost:8001/v1 |
auto | — |
llamacpp |
local | OpenAI | http://localhost:8080/v1 |
auto | — |
ds4 |
local | OpenAI | http://127.0.0.1:8000/v1 |
auto | — |
deepseek |
cloud | OpenAI | https://api.deepseek.com/v1 |
deepseek-v4-flash |
DEEPSEEK_API_KEY |
openrouter |
cloud | OpenAI | https://openrouter.ai/api/v1 |
anthropic/claude-sonnet-4 |
OPENROUTER_API_KEY |
groq |
cloud | OpenAI | https://api.groq.com/openai/v1 |
openai/gpt-oss-120b |
GROQ_API_KEY |
cerebras |
cloud | OpenAI | https://api.cerebras.ai/v1 |
gpt-oss-120b |
CEREBRAS_API_KEY |
agnes |
cloud | OpenAI | https://apihub.agnes-ai.com/v1 |
agnes-2.0-flash |
AGNES_API_KEY |
minimax |
cloud | Anthropic | https://api.minimaxi.com/anthropic |
MiniMax-M3 |
MINIMAX_API_KEY |
Every base URL and default model can be overridden by env vars, e.g. OLLAMA_BASE_URL, DEEPSEEK_DEFAULT_MODEL. Need something else? Add a custom backend — no Python required.
MCP tools
| Tool | Purpose |
|---|---|
ask_local_model |
Send a prompt to a backend, get back text + usage metadata. |
list_backends |
Show configured backends, URLs, protocols, key status. |
local_status |
Memory, guard state, backend reachability, config paths. |
list_local_models / list_models |
List model IDs from backends that expose GET /models. |
set_backend |
Add, update, or remove a custom backend live. |
refresh_backends |
Reload custom_backends.json without restarting. |
set_guard |
Change the RAM / exclusivity guards live. |
set_system_prefix |
Pin a system prefix for prompt-cache-friendly cloud calls. |
Talking to it
Just tell the agent what to delegate:
Use ask_local_model with backend="ollama" to draft a pytest suite for this module.
Don't apply it — review it first, then edit the repo yourself.
Call local_status. If DS4 is running, use backend="ds4" for boilerplate.
Otherwise fall back to backend="ollama".
More patterns in examples/claude-code-prompts.md.
Safety model
Local models are happy to OOM your machine. Two guards stop that:
- RAM valve — local calls are refused when free memory drops below
LOCAL_LLM_MIN_FREE_GB. - Exclusive backend — when a heavy local backend (
ds4by default) is up, other local backends are blocked so they don't fight for RAM.
Tune them live, no restart:
set_guard(min_free_gb=8)
set_guard(exclusive_backend="none")
set_guard(enforce=0)
Secrets never enter git. keys.json, config.json, and custom_backends.json are gitignored; keys load from env vars or a chmod 600 file. Cloud backends skip the RAM guard — but they send your prompts to a third party and may cost money, so read SECURITY.md before pointing one at proprietary code.
Custom backends
Any OpenAI- or Anthropic-compatible endpoint works. Add it from the tool:
set_backend(name="my_qwen", base_url="http://localhost:9000/v1", default_model="qwen3-coder", local=1, protocol="openai")
...or drop it in custom_backends.json and call refresh_backends. See examples/custom_backends.openrouter.json.
Does it actually save money?
It depends entirely on your workload — so this repo refuses to print a fake percentage. Instead it ships a harness so you can measure your numbers:
python scripts/benchmark.py --backend ollama --model qwen2.5-coder:7b --out results.jsonl
Then compare premium-only vs. delegated mode using BENCHMARK.md and the runbook. Report what you find; don't take a number on faith — not even ours.
Manual install
python3 -m venv .venv
.venv/bin/python -m pip install -e .
# Claude Code
claude mcp add local-llm -s user -e LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"
# Codex
codex mcp add local-llm --env LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"
Tests
python -m unittest discover -s tests -v
python scripts/smoke_test.py
CI runs both on Python 3.10, 3.11, and 3.12.
FAQ
Does the local model touch my files? No. It returns text only. Every edit goes through the main agent.
Do I need a GPU / a big Mac? No. Small coder models (7B) run on modest hardware, and the RAM valve keeps you from OOMing. No local model handy? Point it at a cheap cloud backend.
Is this a fusion model or an autonomous agent? Neither. It's a delegation layer — a tool your existing agent calls.
Why not just switch Claude Code to a cheaper model entirely? Because you want the frontier model's judgment and the cheap model's typing. This keeps both.
Windows / Linux? The server is cross-platform; the RAM guard reads vm_stat on macOS and /proc/meminfo on Linux. Shell helpers are macOS/zsh-flavored.
Related projects
- qwable — a local multi-model gateway and agent runtime for Codex & Claude Code on Apple Silicon.
- Conclava — a council of local LLMs with task-aware routing and multi-model deliberation.
Contributing
Issues and PRs welcome — see CONTRIBUTING.md. Launch notes live in docs/.
License
MIT — see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.