qwen-memory-agent
MCP-native persistent-memory agent that remembers user preferences across sessions, forgets superseded facts, and recalls relevant memories within a tight token budget.
README
qwen-memory-agent
A benchmarked, MCP-native persistent-memory agent built on Qwen Cloud (Alibaba Cloud / DashScope). Submitted to the Qwen Cloud Hackathon, Track 1 — MemoryAgent.
The agent itself decides — via Qwen function-calling — when to remember, recall, or forget. It carries user preferences across sessions, forgets superseded facts, and recalls the right memories inside a tight token budget — and proves it with numbers against naive baselines.
Why it's different
Most memory agents are "stuff everything into RAG and hope." This one treats memory as a measurable engineering problem, and every capability maps to a Track-1 requirement:
- Agentic memory via Qwen function-calling — the model invokes
remember/recall/forgettools through a real agent loop. It's an agent with memory, not a database with an LLM bolted on. - Supersession-aware forgetting (exact and semantic) — when a new fact contradicts an old one, the old record is retired. Exact
(subject, type)match handles the clean case; a cosine-similarity pass (configurableSUPERSEDE_THRESHOLD) also retires near-paraphrases the model filed under a different subject — the case that defeats exact matching in a real agent loop. - Graded, time-based decay + reinforce-on-recall —
effective_salience = salience · 0.5^(age / half_life)(per-type half-lives;preferencepinned). Recalling a memory refreshes it (access_count,last_accessed), so hot memories stay and cold ones fade — "timely forgetting of outdated information." - Typed retrieval — a second self-correcting layer — a type-aware ranking prior (a durable
preferenceoutranks a throwawayepisodicnote of equal cosine) plus a retrieval-time "one-active-per-(subject, type), keep-newest" veto that catches stale contradictions the write path can miss (e.g. records that arrive via import). "Recall the most critical memories under limited context." - Budget-constrained recall — retrieval scores memories by
α·cosine + β·recency + γ·effective_salience + δ·type_priorand greedily packs them until a configurable token budget is hit, so context stays small and relevant. - Portable memory (export / import) — the whole store round-trips as JSON (vectors preserved, no re-embedding) or renders to Markdown, so memory moves across sessions and machines.
- Persistent across restarts — set
MEMORY_PERSIST_PATHand the store writes an atomic JSON snapshot on every change and reloads it on startup (rebuilding the vector index), so memories survive a full server restart — real persistence, not process-lifetime state. - The dreaming loop (propose → approve) — an out-of-band Qwen pass reviews the store and proposes consolidations (merge / forget / re-salience); a human approves, then only approved proposals are applied. It validates every proposal against live record ids, so it refuses to act on its own hallucinations. "Autonomously accumulate experience" — with a human in the loop.
- Token & model observability — every Qwen call's
usage(prompt / completion / total tokens, per model) is accumulated and exposed at/usage;/chatreports the per-request token delta. - A reproducible benchmark — synthetic multi-session personas, a held-out query set, and baselines (no-memory / full-history / naive-RAG / ours), scored on context recall (retrieval-level, model-free), staleness rate, and a context-efficiency curve.
Architecture
flowchart TB
U["MCP client / demo UI"]
subgraph ecs["Alibaba Cloud ECS (Singapore)"]
API["FastAPI backend<br/>/chat · /health · /usage<br/>/memory/export · /memory/import<br/>/dream · /dream/apply"]
AGENT["MemoryAgent loop<br/>Qwen function-calling"]
DREAM["Dreaming loop<br/>propose → approve consolidation"]
MCP["FastMCP server<br/>remember / recall / forget / stats<br/>export / import / dream / dream_apply"]
ENG["Memory Engine<br/>write · retrieve · exact + semantic supersession<br/>typed retrieval · decay + reinforce · dreaming loop<br/>token-budget packing"]
QD[("Qdrant<br/>embedded vector store")]
SNAP[("Disk snapshot<br/>memory.json · survives restart")]
end
DS["Qwen Cloud / DashScope-intl<br/>reasoning model + text-embedding-v3<br/>(usage metered per call)"]
U -->|HTTP| API
U -.->|MCP| MCP
API --> AGENT
API --> DREAM
AGENT -->|"decides which tool to call"| ENG
DREAM -->|"proposes / applies"| ENG
MCP --> ENG
AGENT <-->|"chat + tool specs"| DS
DREAM <-->|"review memories"| DS
ENG <-->|"embed"| DS
ENG <--> QD
ENG <-->|"save on write / load on start"| SNAP
The agent loop (/chat) lets Qwen choose tool calls; the same memory engine is also exposed directly over MCP for any MCP client, and the dreaming loop drives it as a maintenance pass. With MEMORY_PERSIST_PATH set, the engine snapshots to disk on every change and rehydrates on startup, so the store survives a restart. The Qwen client has bounded retry/backoff for resilience and meters token usage on every call.
HTTP + MCP surface
| HTTP route | MCP tool(s) | Purpose |
|---|---|---|
POST /chat |
memory.remember / recall / forget |
agent loop; Qwen picks memory tools |
GET /usage |
— | accumulated token usage (per model) |
GET /memory/export · POST /memory/import |
memory.export / memory.import |
round-trip the store (JSON + Markdown) |
POST /dream · POST /dream/apply |
memory.dream / memory.dream_apply |
propose consolidations, then apply approved ones |
GET /health |
memory.stats |
liveness / store counts |
Stack
Python · FastAPI · Qwen function-calling agent loop · FastMCP · openai SDK → DashScope-intl · Qwen text-embedding-v3 · Qdrant · tiktoken (budget accounting).
Quickstart
uv sync
cp .env.example .env # set DASHSCOPE_API_KEY + DASHSCOPE_BASE_URL
PYTHONPATH=src uv run --no-sync pytest -q tests/ # fully mocked — zero Qwen credit spend
Benchmark results
Reproducible and fully offline — PYTHONPATH=src uv run --no-sync python -m benchmark.run uses a deterministic
bag-of-vocabulary embedder, so the harness measures the memory engine's ranking +
supersession logic (not embedding noise) and costs zero Qwen credits. All three systems
compete under the same shrinking token budget, so this is a fair context-efficiency test.

Context recall (retrieval-level, model-free) and staleness rate (fraction of retrieved contexts
containing a retired fact; lower is better) vs the memory token budget, over the six-persona,
24-query synthetic set in benchmark/generate.py. Token budgets use tiktoken's
gpt-4o-mini encoding as a consistent approximation for Qwen context accounting.
| Budget (tokens) | 8 | 16 | 32 | 64 |
|---|---|---|---|---|
| B1 full-history — context recall / staleness | 0.000 / 0.250 | 0.375 / 0.250 | 0.958 / 0.250 | 1.000 / 0.250 |
| B2 naive top-k — context recall / staleness | 0.875 / 0.125 | 1.000 / 0.250 | 1.000 / 0.250 | 1.000 / 0.250 |
| B3 ours — context recall / staleness | 1.000 / 0.000 | 1.000 / 0.000 | 1.000 / 0.000 | 1.000 / 0.000 |
B3 holds context recall 1.000 and staleness 0.000 at every budget — it's the only system that recalls the current facts and never re-surfaces retired ones. Two things the naive baselines can't do:
- B1 (dump history chronologically) wastes its budget on the oldest facts, so it needs a large budget just to recall the current answer — and it permanently carries the stale one.
- B2 (keyword top-k) gets staler as the budget grows: with no notion of "replaced," extra budget pulls retired facts back in, so its staleness climbs 0.125 → 0.250 and then plateaus.
Only supersession-aware forgetting + budget-constrained recall keeps the working set both correct and small.
The semantic supersession threshold is also checked against live DashScope text-embedding-v3
embeddings in docs/embedding-validation.md. That run did not produce a perfect validation:
supersession-pair cosines were 0.879-0.908, while unrelated distractors were 0.683-0.743. The
default SUPERSEDE_THRESHOLD=0.9 is therefore conservative and should be revisited with a larger
set rather than treated as a proven universal constant.
License
MIT — see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.