llm-localfirst
A local-first LLM routing MCP server that keeps sensitive data on your own models, with fail-closed privacy and manager-worker delegation, exposing route and complete tools to any MCP client.
README
llm-localfirst
Local-first LLM routing β keep sensitive data and bulk text labor on your own models, and call the cloud only for the hard part.
<p align="center"> <img src="https://raw.githubusercontent.com/shaxzodbek-uzb/llm-localfirst/main/docs/demo.svg" alt="llm-localfirst CLI: bulk work falls back to cloud; a sensitive call fails closed and never reaches the cloud" width="820"> </p>
Most LLM routers optimize which cloud provider to call for cost or failover.
llm-localfirst inverts the default: it runs on your own local model first
(Ollama / vLLM / LM Studio) and reaches for the cloud only when the work genuinely
needs it. It adds two things mainstream routers don't:
- π Privacy routing that fails closed. A call you flag
sensitive=Trueis pinned to a local model and is never allowed to fall back to the cloud. If the local model is down, the call raises β it does not quietly ship your prompt to a third-party API. - π€ Managerβworker delegation. Give a cloud "director" agent a drop-in tool that offloads token-heavy, low-risk text labor (summarize / draft / translate / reformat / extract / classify) to a fast local worker β cutting cloud spend and keeping bulk data on your hardware. (Extracted from a production Pydantic AI agent.)
Plus a model allowlist guard (arbitrary model strings are rejected β an SSRF/cost blast-radius control), cached reachability probing, an MCP server wrapper, and a CLI.
The privacy guarantee, in five lines
from llm_localfirst import Router, LocalUnavailable
router = Router.from_env()
try:
out = router.complete("Redact all PII from this record.",
source=customer_record, sensitive=True)
except LocalUnavailable:
# Local model is down. We did NOT send the record to the cloud. You decide.
...
sensitive=True means this data must not leave the box. The router would rather fail
than leak. That asymmetry β sensitive calls fail closed, ordinary bulk calls fall back to
cloud β is the product.
Install
pip install llm-localfirst # the routing brain β zero provider SDKs
pip install "llm-localfirst[openai]" # + talk to local Ollama/vLLM/LM Studio (and cloud OpenAI)
pip install "llm-localfirst[anthropic]" # + Claude (the default cloud fallback / reason model)
pip install "llm-localfirst[all]" # everything (also: mcp, pydantic-ai)
| Extra | Adds | Needed for |
|---|---|---|
| (none) | pydantic-settings |
router.decide(...) β pure routing, no calls |
openai |
openai |
running calls on a local OpenAI-compatible server (or cloud OpenAI) |
anthropic |
anthropic |
the default cloud fallback / reason model (Claude) |
mcp |
mcp |
llm-localfirst mcp (expose the router over MCP) |
pydantic-ai |
pydantic-ai-slim |
the manager-worker attach_worker integration |
The decision path (decide()) imports no provider SDK, so you can inspect routing β
and run the whole test suite β with nothing but the core installed.
60-second quickstart (Ollama)
ollama pull qwen2.5:7b # any OpenAI-compatible local server works
pip install "llm-localfirst[openai,anthropic]"
export ANTHROPIC_API_KEY=sk-ant-... # only needed for the cloud fallback / reason path
from llm_localfirst import Router, Kind
router = Router.from_env()
# 1) Inspect routing WITHOUT spending a token.
print(router.decide(kind=Kind.BULK)) # -> local (cheap + private)
print(router.decide(kind="reason")) # -> cloud (the hard part)
print(router.decide(sensitive=True)) # -> local (pinned; never cloud)
# 2) Actually run it. Bulk work prefers local, and falls back to cloud only if local is down.
print(router.complete("Summarize this in one sentence.",
source=long_text, kind=Kind.BULK).text)
Or from the shell:
llm-localfirst doctor # show config, the allowlist, and local up/down
llm-localfirst route "summarize this" --kind bulk
llm-localfirst route "redact this" --sensitive # exits non-zero if local is down (fail-closed)
How routing decides
decide() probes whether your local model is reachable (cached), then applies these
rules in order:
| Call | Local up | Local down |
|---|---|---|
sensitive=True |
local | raises LocalUnavailable (fail-closed) |
explicit model="<cloud>" + sensitive=True |
β | raises PrivacyViolation |
kind="reason" |
cloud | cloud |
kind="bulk" / "auto" (default) |
local | cloud fallback (fell_back=True) |
explicit model="<name>" |
that allowlisted model (cloud blocked only when sensitive) |
Any explicit model must be a name on the allowlist; an arbitrary string (or a stray
URL) raises ModelNotAllowed. That allowlist is the SSRF / cost guard β a caller can
never point the router at a new endpoint or an expensive model it wasn't configured with.
Managerβworker delegation (Pydantic AI)
Let a cloud director keep the planning and tool-calls, and offload the grunt text work to a local worker:
from pydantic_ai import Agent
from llm_localfirst import Router
from llm_localfirst.integrations.pydantic_ai import attach_worker
router = Router.from_env()
director = Agent("anthropic:claude-haiku-4-5", system_prompt="...")
# Adds a `delegate_to_worker(task, source)` tool that routes to your LOCAL model.
# attach_worker REFUSES a non-local worker, so delegated source text can't leak.
attach_worker(director, router, worker_model="local",
on_delegate=lambda task, result: ...) # optional observability hook
The director calls delegate_to_worker for summaries, drafts, translations,
reformatting, and extraction; those run on your GPU instead of burning cloud tokens.
See examples/manager_worker.py.
MCP-native
Expose the router to any MCP client (Claude Desktop, IDEs, agents) as two tools β
route (dry decision) and complete:
pip install "llm-localfirst[mcp]"
llm-localfirst mcp # serves over stdio
How it compares
llm-localfirst is not a general multi-provider gateway, and it isn't trying to be.
To be clear and fair: LiteLLM and Bifrost can already route to local models (Ollama,
vLLM) β local capability is not the differentiator. The differentiators are the
fail-closed privacy pin, the manager-worker delegation tool, and a local-first
default posture.
| Capability | llm-localfirst | LiteLLM | OpenRouter | llmrouter-lib |
|---|---|---|---|---|
| Route to local models (Ollama/vLLM) | β | β | β | β |
| Default posture is local-first | β | β (cloud proxy) | β | β |
| Sensitive calls fail closed β never fall back to cloud | β | β | β | β |
| Manager-worker delegation tool (cloudβlocal) | β | β | β | β |
| Allowlist guard (reject arbitrary model strings) | β | β | β | β |
| Many cloud providers / load-balancing / caching | β (by design) | β | β | β |
If you want a broad cloud gateway with dozens of providers, use LiteLLM. If you want your private data to stay local by construction and your bulk work to run on your own hardware, that's this library.
What this is NOT
- Not a multi-cloud gateway. It ships one local backend + Claude (+ optional OpenAI). Add more by registering them on the allowlist; it won't grow a hundred provider shims.
- Not a content classifier. You tag a call
sensitive=True(or pick akind). It does not guess whether your text is private β it enforces what you declare. - Not load-balancing / semantic caching / cost analytics. Those are gateway features; this is a routing policy with a privacy guarantee.
- Not a prompt firewall. It controls where a call runs, not what's in it.
Configuration
All settings are read from the environment (prefix LF_) or a .env file. See
.env.example. Highlights:
| Variable | Default | Meaning |
|---|---|---|
LF_LOCAL_BASE_URL |
http://localhost:11434/v1 |
local OpenAI-compatible endpoint |
LF_LOCAL_MODEL_ID |
qwen2.5:7b |
local model id |
LF_FALLBACK_MODEL |
haiku |
cloud model for non-sensitive fallback |
LF_REASON_MODEL |
haiku |
cloud model for kind="reason" |
LF_SENSITIVE_FAIL_CLOSED |
true |
keep sensitive calls from ever leaking |
LF_PROBE_TTL |
30.0 |
seconds to cache the reachability probe |
Development
uv venv && uv pip install -e '.[dev]'
ruff check . && pytest
The routing brain (policy, registry, router, reachability) is covered 100% offline β no network and no provider SDKs required. Contributions welcome; see CONTRIBUTING.md.
License
MIT Β© Shaxzodbek Qambaraliyev / Blaze. See LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.