mem0-mcp-toggle
A local Mem0 MCP server for macOS with a menu bar toggle, storing memories locally in Chroma and using any OpenAI-compatible LLM for fact extraction.
README
local-mem0-mcp
English | 한국어
A fully local, zero-config Mem0 memory server for MCP clients on macOS. No LLM, no API key, no cloud — and no switch to flip. It starts when your IDE/CLI opens and shuts itself off (freeing RAM) when you're done.
Unofficial community tool — not affiliated with mem0ai.
Highlights
- 🧠 No LLM in the loop. Your MCP client is already a capable LLM, so it does the "smart memory" reasoning (extract facts, dedup, merge, resolve conflicts) and calls simple primitives. No second model, no API key, no cost.
- 💾 100% local. Embeddings run on-device (
all-MiniLM-L6-v2); memories live in a local Chroma store at~/.mem0-mcp/chroma. Works offline. - ⚡ Auto-managed lifecycle. Launching a client starts the backend on demand; closing the last client lets it idle-exit and free ~200 MB. No manual toggle.
- 🤝 Multi-client safe. Kiro, Claude Desktop, Cursor, … all share one backend process — a single Chroma writer, no duplicate servers, no zombies.
- 📌 Always-on core memory. Pin the few must-not-forget facts; they're mirrored to a file your rules load every session, so they're always in context — no search required.
How it fits together
┌────────────┐ stdio ┌───────────────┐ HTTP 127.0.0.1:8765 ┌─────────────────────┐
│ MCP client │─spawns──▶│ mem0_proxy │────────────────────────▶│ mem0 backend (one) │
│ (Kiro/IDE) │◀─tools───│ (per client) │ forwards + keepalive │ embed + Chroma │
└────────────┘ └───────────────┘ └─────────────────────┘
│ close ─▶ proxy dies ─▶ backend idle-exits (frees RAM) ▲ single writer
more clients ── each spawns its own lightweight proxy ────────────────────┘ (shared backend)
Your client launches a tiny stdio proxy. The proxy starts the shared HTTP backend on demand and forwards every tool call to it, keeping it warm while you work. When the last client closes, the backend idle-exits on its own.
Requirements
- macOS 12+
- Python 3.10+ (
python3)
That's it — no Xcode, no API keys, no external services. The embedding model downloads once on first use (~90 MB), then runs fully offline.
Install
git clone https://github.com/ost527/local-mem0-mcp.git
cd local-mem0-mcp
./install.sh
install.sh creates a virtualenv, installs deps (mem0ai, fastmcp, chromadb,
sentence-transformers), and registers a single on-demand launchd agent for
the backend. It prints the exact MCP config snippet to copy. Tune defaults via
env vars:
MEM0_MCP_PORT=8800 MEM0_IDLE_TIMEOUT=900 ./install.sh
Connect your MCP client
Add this to your client's MCP config (e.g. ~/.kiro/settings/mcp.json, Claude
Desktop, Cursor) — point it at the stdio proxy (use the absolute paths
install.sh prints):
{
"mcpServers": {
"local-mem0-mcp": {
"command": "/ABS/PATH/local-mem0-mcp/.venv/bin/python3",
"args": ["/ABS/PATH/local-mem0-mcp/server/mem0_proxy.py"]
}
}
}
Restart the client. The first memory call takes a few seconds (the backend cold- starts and loads the embedder); after that it's instant.
Tools
| Tool | What it does |
|---|---|
add_memory(text, user_id?) |
Store a fact verbatim. Returns the nearest existing memories so you can reconcile. |
update_memory(id, text) |
Replace/merge an existing memory (avoid duplicates). |
delete_memory(id) |
Remove an outdated or contradicted memory. |
search_memories(query, user_id?) |
Semantic search; returns memories with IDs (📌 marks pinned/core). |
list_memories(user_id?) |
List everything stored (with IDs; 📌 marks pinned/core). |
pin_memory(id) |
Pin a memory into always-on core (mirrored to a file your rules load every session). Bounded by MEM0_CORE_BUDGET. |
unpin_memory(id) |
Remove from core; the memory stays stored and searchable. |
Prompt & resources (for clients that surface them) make recall low-friction — no need for the agent to remember to search:
| Kind | Name | What it does |
|---|---|---|
| Prompt | load_context(query?) |
Pull relevant memories into the conversation as context — invoke at the start of a task so the agent recalls instead of re-asking. No query = list all. |
| Prompt | curate_memories() |
Maintenance pass: full inventory + usage stats, with instructions for the agent to merge duplicates, drop stale facts, rewrite, and re-balance core. |
| Resource | memory://all |
All stored memories (with IDs). |
| Resource | memory://core |
The pinned always-on core set. |
| Resource | memory://search/{query} |
Hybrid-ranked memories for query. |
Getting agents to use memory proactively
Storage is half the problem; the other half is getting agents to recall before asking and save without being told — so you never repeat yourself and tokens aren't burned re-explaining. Three layers push for that:
-
Server instructions (built in). Sent to every client in the MCP initialize response; most clients inject them into the agent's system prompt: search memory at task start and before asking the user anything, save durable facts the moment they appear, reconcile instead of duplicating, never store secrets. Both the backend and the proxy declare them (a FastMCP proxy answers initialize itself), see
server/mem0_instructions.py. -
When-to-call tool descriptions (built in).
search_memoriesandadd_memorycarry explicit triggers, so even an agent that reads only the tool schema knows when to fire them. -
A rules-file snippet (recommended). Clients differ in whether they surface server instructions, so for maximum reliability also paste this into the agent's always-on rules (
AGENTS.md,CLAUDE.md,.cursorrules, Kiro steering, ...):## Long-term memory (local-mem0-mcp) You have persistent memory shared with the user's other LLM clients/agents. Use it without being asked: - Task start: call search_memories with the task's key terms. - Before asking the user anything: search_memories first — the answer may already be stored. - On learning a durable fact (decision, preference, config, path, environment quirk): call add_memory immediately, one atomic fact per call. - Reconcile, don't duplicate: update_memory to refine/merge; delete_memory when a memory becomes wrong. - Never store secrets (passwords, API keys, tokens).
Core memory (always-on)
Retrieval has one structural gap: the agent has to decide to search. Core
memory closes it. Pin the handful of must-not-forget facts — project identity,
key paths, environment, core preferences — and they're mirrored to a plain file,
~/.mem0-mcp/CORE_MEMORY.md, that your always-on rules load every session.
Those facts reach the agent with no tool call and no retrieval luck.
-
Pin / unpin.
pin_memory(id)adds a memory to core;unpin_memory(id)removes it. Either way the memory stays stored and searchable; pinned entries show 📌 insearch_memories/list_memories. -
Bounded by design. Core is capped at
MEM0_CORE_BUDGETcharacters (default 4000). It loads into every session, so the cap keeps that always-on block small — pinning past it is refused until you unpin or shorten. -
Activate it once. Add a line to your always-on rules file so the agent reads the mirror at the start of every session:
## Core memory (always-on) At the START of every session, read ~/.mem0-mcp/CORE_MEMORY.md — the user's pinned, always-on core memory. (Claude Code: import it with `@~/.mem0-mcp/CORE_MEMORY.md`.)
The mirror file is auto-generated (re-synced on every pin/unpin and at backend
startup) — never edit it by hand. Core is also exposed as the memory://core
resource and shown at the top of load_context.
Keeping memory tidy (curation)
Every search quietly records lightweight usage stats — retrieval count and
last-used date — per memory. The curate_memories prompt turns those into a
maintenance pass: it lays out the full inventory (📌 pinned, created date, usage)
and asks the agent to merge duplicates, drop stale facts, tighten wording, and
re-balance what deserves an always-on core slot — one tool call at a time. Run it
periodically or whenever memory feels noisy. (Low usage alone is never a reason
to delete: durable facts stay.)
How memory works (the client is the brain)
Mem0's value is "smart memory": pull out the durable facts, then add / update / delete so memory stays deduplicated and consistent. That normally needs an LLM — but your MCP client is one, so it does the reasoning and drives these tools:
- Extract the atomic facts worth keeping from the conversation.
search_memoriesfor related / duplicate / contradicting entries.- Reconcile:
add_memory(new) ·update_memory(refine/merge) ·delete_memory(obsolete).
To make step 3 easy, add_memory also returns the nearest existing memories.
Under the hood the server uses mem0's infer=False path — embed and store
verbatim — so writes are instant and deterministic, with no model call.
Retrieval & tuning
Search is hybrid by default: dense vector similarity (semantic) fused with a
local BM25 lexical signal, so both paraphrases and exact identifiers (file paths,
env-var names, IPs, function names) surface. Fusion defaults to rescue — it
keeps the dense ranking and only adds exact matches the vector model missed, so
it never reorders good dense results (provably non-regressing; its payoff grows as
the store gets larger). An aggressive Reciprocal Rank Fusion is available via
MEM0_FUSION=rrf (it can reorder dense results — measure first). Turn hybrid off
with MEM0_HYBRID_SEARCH=0. No extra dependency; all local and deterministic.
Measure before you tune. server/eval_recall.py builds a throwaway store
with a labeled corpus and reports hit@k / MRR for dense vs hybrid (it never touches
your real store or the backend):
.venv/bin/python server/eval_recall.py
EVAL_VERBOSE=1 .venv/bin/python server/eval_recall.py # per-query first-hit ranks
Trying a different embedder. Swapping MEM0_EMBEDDER_MODEL on a populated
store breaks ranking (old vectors were produced by the old model). Compare
candidates with the harness, then re-embed safely (backs up first; stop the backend
first):
MEM0_EMBEDDER_MODEL=intfloat/multilingual-e5-small MEM0_EMBEDDER_DIMS=384 \
.venv/bin/python server/migrate_reembed.py
Good local, multilingual-friendly options for a bilingual store (both 384 dims):
intfloat/multilingual-e5-small and
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
Lifecycle (auto start / stop)
- Your IDE/CLI launches → it spawns
server/mem0_proxy.py(stdio) as a child. - The proxy runs
launchctl kickstartto start the shared backend if it isn't already up, then forwards tool calls and sends a periodic keepalive. - You close the client → the proxy dies → with nothing keeping it warm, the
backend idle-exits after
MEM0_IDLE_TIMEOUTseconds and frees its RAM. (It waits for any in-flight memory operation to finish first, so a write is never cut off mid-flight.) - Open any client again → the proxy starts the backend again.
Every proxy forwards to the same backend, so there is exactly one Chroma writer even with several clients open at once.
Configuration
Backend (server/mem0_mcp_server.py; set in
launchd/com.mem0mcp.server.plist.template, then re-run install.sh, or pass to
install.sh):
| Var | Default | Notes |
|---|---|---|
MEM0_IDLE_TIMEOUT |
600 |
seconds of inactivity before the backend exits; 0 disables |
MEM0_EMBEDDER_MODEL |
sentence-transformers/all-MiniLM-L6-v2 |
local embedder |
MEM0_EMBEDDER_DIMS |
384 |
must match the model |
MEM0_CHROMA_PATH |
~/.mem0-mcp/chroma |
vector store location |
MEM0_COLLECTION |
mem0 |
Chroma collection name |
MEM0_DEFAULT_USER |
developer_workspace |
default user_id |
MEM0_RELATED_TOPK |
3 |
nearest memories add_memory surfaces |
MEM0_SEARCH_TOPK |
10 |
results search_memories returns |
MEM0_CORE_BUDGET |
4000 |
max total chars of pinned (core) memories; pinning past it is refused |
MEM0_CORE_FILE |
~/.mem0-mcp/CORE_MEMORY.md |
always-on core mirror file (rules files read this) |
MEM0_META_FILE |
~/.mem0-mcp/memory_meta.json |
sidecar: pin state + per-memory usage stats |
MEM0_HYBRID_SEARCH |
1 |
hybrid dense+lexical retrieval; 0 = dense only |
MEM0_FUSION |
rescue |
rescue (non-regressing) or rrf (aggressive) |
MEM0_RRF_K |
60 |
RRF constant (used only when MEM0_FUSION=rrf) |
MEM0_BM25_MAX_DOCS |
5000 |
cap on lexical scan size for very large stores |
MEM0_MCP_PORT |
8765 |
backend HTTP port (must match the proxy) |
Proxy (server/mem0_proxy.py; set via the env block of your MCP config):
| Var | Default | Notes |
|---|---|---|
MEM0_MCP_PORT |
8765 |
backend port to reach / kickstart |
MEM0_SERVER_LABEL |
com.mem0mcp.server |
launchd label to start on demand |
MEM0_PROXY_KEEPALIVE |
clamp(IDLE/3, 5, 120) |
seconds between keepalive pings |
MEM0_BACKEND_READY_TIMEOUT |
40 |
seconds to wait for the backend to come up |
Why this design
- The client is the intelligence. Running a second local LLM just to re-extract facts was the biggest source of friction (had to be running, had to be a non-reasoning instruct model, slow). Since the calling agent is already an LLM, we drop that entirely and use mem0's verbatim-store path. (mem0 still constructs an LLM client internally; it is wired so it is never contacted.)
- One shared HTTP backend. Plain MCP stdio spawns a separate server per client — multiple clients would open the same Chroma store with multiple writers (lock/corruption risk) and can orphan into zombie processes. A single shared backend gives one writer and no duplicates. Inside that backend a single global lock serializes every memory operation (reads and writes), so concurrent calls from multiple clients can never interleave or corrupt the store — they queue and run one at a time. An OS-level file lock on the store directory hard-enforces the single writer: a second backend pointed at the same store refuses to start rather than risk corruption. (Data-loss safety is prioritized over throughput here; memory ops are fast and infrequent, so the serialization is imperceptible.)
- A per-client stdio proxy for lifecycle. The proxy is lightweight (no embedder/Chroma) and its lifetime tracks the client, so the backend can start on launch and stop on close — the on-demand behaviour a bare HTTP URL can't provide.
- Idle auto-exit frees RAM. The backend holds ~200 MB; it exits shortly after the last client disconnects and restarts on the next launch.
FAQ
What happened to the menu-bar toggle (and the old name)?
Early versions shipped a menu-bar on/off switch and were named mem0-mcp-toggle.
The toggle was replaced by the automatic lifecycle above, and the project was
renamed to local-mem0-mcp.
Does it need an LLM or API key? No. Only a local embedder, which downloads once and then runs offline.
What's "core memory"? Regular memories surface only when searched; pinned
core memories load into every session via ~/.mem0-mcp/CORE_MEMORY.md (see
Core memory). Use pin_memory for the few facts you
always want in context.
Where is my data? ~/.mem0-mcp/chroma (vectors), plus
~/.mem0-mcp/CORE_MEMORY.md (pinned-core mirror) and
~/.mem0-mcp/memory_meta.json (pin state + usage stats). Uninstalling keeps them.
Can I run several clients at once? Yes — they all share the one backend (single Chroma writer).
Troubleshooting
- Tools missing / client can't connect → check the
command/argspaths in your MCP config point at this repo's.venv/bin/python3andserver/mem0_proxy.py. The proxy logs to stderr (visible in your client's MCP logs). - Backend won't start → confirm the agent is registered:
launchctl print gui/$(id -u)/com.mem0mcp.server. Check~/Library/Logs/mem0-mcp.log. Start it manually withlaunchctl kickstart gui/$(id -u)/com.mem0mcp.server. - Log says "refusing to start a second Chroma writer" → expected, not a bug:
another backend already holds the store's single-writer lock
(
~/.mem0-mcp/chroma/.writer.lock). Only one backend may write at a time. Use the one that's already up, or stop it first (launchctl kill TERM gui/$(id -u)/com.mem0mcp.server) before starting another. (During a normal restart the new backend briefly retries while the old one exits, so this only persists if a backend is genuinely still running.) - First write is slow / needs internet → the embedder downloads once, then runs offline.
- Search feels off on an older store → stores created before the cosine
upgrade use Chroma's default L2 distance; with the backend stopped, run
.venv/bin/python server/migrate_cosine.pyto switch to cosine (reuses embeddings, backs up first). New installs already use cosine. - Free RAM right now → close your clients (it idle-exits), or
launchctl kill TERM gui/$(id -u)/com.mem0mcp.server. - Only runs while logged in — it's a LaunchAgent (per-user GUI session), not a boot daemon.
- Logs:
~/Library/Logs/mem0-mcp.log.
Uninstall
./uninstall.sh
Removes the launchd backend agent (and any legacy menu-bar toggle). Keeps your
stored memories (~/.mem0-mcp/chroma) and the venv.
License
MIT — see LICENSE. Built on mem0ai/mem0, FastMCP, Chroma, and sentence-transformers; each retains its own license.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.