qwen-memory-agent

qwen-memory-agent

MCP-native persistent-memory agent that remembers user preferences across sessions, forgets superseded facts, and recalls relevant memories within a tight token budget.

Category
Visit Server

README

qwen-memory-agent

A benchmarked, MCP-native persistent-memory agent built on Qwen Cloud (Alibaba Cloud / DashScope). Submitted to the Qwen Cloud Hackathon, Track 1 — MemoryAgent.

The agent itself decides — via Qwen function-calling — when to remember, recall, or forget. It carries user preferences across sessions, forgets superseded facts, and recalls the right memories inside a tight token budget — and proves it with numbers against naive baselines.

Why it's different

Most memory agents are "stuff everything into RAG and hope." This one treats memory as a measurable engineering problem, and every capability maps to a Track-1 requirement:

  • Agentic memory via Qwen function-calling — the model invokes remember / recall / forget tools through a real agent loop. It's an agent with memory, not a database with an LLM bolted on.
  • Supersession-aware forgetting (exact and semantic) — when a new fact contradicts an old one, the old record is retired. Exact (subject, type) match handles the clean case; a cosine-similarity pass (configurable SUPERSEDE_THRESHOLD) also retires near-paraphrases the model filed under a different subject — the case that defeats exact matching in a real agent loop.
  • Graded, time-based decay + reinforce-on-recalleffective_salience = salience · 0.5^(age / half_life) (per-type half-lives; preference pinned). Recalling a memory refreshes it (access_count, last_accessed), so hot memories stay and cold ones fade — "timely forgetting of outdated information."
  • Typed retrieval — a second self-correcting layer — a type-aware ranking prior (a durable preference outranks a throwaway episodic note of equal cosine) plus a retrieval-time "one-active-per-(subject, type), keep-newest" veto that catches stale contradictions the write path can miss (e.g. records that arrive via import). "Recall the most critical memories under limited context."
  • Budget-constrained recall — retrieval scores memories by α·cosine + β·recency + γ·effective_salience + δ·type_prior and greedily packs them until a configurable token budget is hit, so context stays small and relevant.
  • Portable memory (export / import) — the whole store round-trips as JSON (vectors preserved, no re-embedding) or renders to Markdown, so memory moves across sessions and machines.
  • Persistent across restarts — set MEMORY_PERSIST_PATH and the store writes an atomic JSON snapshot on every change and reloads it on startup (rebuilding the vector index), so memories survive a full server restart — real persistence, not process-lifetime state.
  • The dreaming loop (propose → approve) — an out-of-band Qwen pass reviews the store and proposes consolidations (merge / forget / re-salience); a human approves, then only approved proposals are applied. It validates every proposal against live record ids, so it refuses to act on its own hallucinations. "Autonomously accumulate experience" — with a human in the loop.
  • Token & model observability — every Qwen call's usage (prompt / completion / total tokens, per model) is accumulated and exposed at /usage; /chat reports the per-request token delta.
  • A reproducible benchmark — synthetic multi-session personas, a held-out query set, and baselines (no-memory / full-history / naive-RAG / ours), scored on context recall (retrieval-level, model-free), staleness rate, and a context-efficiency curve.

Architecture

flowchart TB
    U["MCP client / demo UI"]

    subgraph ecs["Alibaba Cloud ECS (Singapore)"]
        API["FastAPI backend<br/>/chat · /health · /usage<br/>/memory/export · /memory/import<br/>/dream · /dream/apply"]
        AGENT["MemoryAgent loop<br/>Qwen function-calling"]
        DREAM["Dreaming loop<br/>propose → approve consolidation"]
        MCP["FastMCP server<br/>remember / recall / forget / stats<br/>export / import / dream / dream_apply"]
        ENG["Memory Engine<br/>write · retrieve · exact + semantic supersession<br/>typed retrieval · decay + reinforce · dreaming loop<br/>token-budget packing"]
        QD[("Qdrant<br/>embedded vector store")]
        SNAP[("Disk snapshot<br/>memory.json · survives restart")]
    end

    DS["Qwen Cloud / DashScope-intl<br/>reasoning model + text-embedding-v3<br/>(usage metered per call)"]

    U -->|HTTP| API
    U -.->|MCP| MCP
    API --> AGENT
    API --> DREAM
    AGENT -->|"decides which tool to call"| ENG
    DREAM -->|"proposes / applies"| ENG
    MCP --> ENG
    AGENT <-->|"chat + tool specs"| DS
    DREAM <-->|"review memories"| DS
    ENG <-->|"embed"| DS
    ENG <--> QD
    ENG <-->|"save on write / load on start"| SNAP

The agent loop (/chat) lets Qwen choose tool calls; the same memory engine is also exposed directly over MCP for any MCP client, and the dreaming loop drives it as a maintenance pass. With MEMORY_PERSIST_PATH set, the engine snapshots to disk on every change and rehydrates on startup, so the store survives a restart. The Qwen client has bounded retry/backoff for resilience and meters token usage on every call.

HTTP + MCP surface

HTTP route MCP tool(s) Purpose
POST /chat memory.remember / recall / forget agent loop; Qwen picks memory tools
GET /usage accumulated token usage (per model)
GET /memory/export · POST /memory/import memory.export / memory.import round-trip the store (JSON + Markdown)
POST /dream · POST /dream/apply memory.dream / memory.dream_apply propose consolidations, then apply approved ones
GET /health memory.stats liveness / store counts

Stack

Python · FastAPI · Qwen function-calling agent loop · FastMCP · openai SDK → DashScope-intl · Qwen text-embedding-v3 · Qdrant · tiktoken (budget accounting).

Quickstart

uv sync
cp .env.example .env   # set DASHSCOPE_API_KEY + DASHSCOPE_BASE_URL
PYTHONPATH=src uv run --no-sync pytest -q tests/  # fully mocked — zero Qwen credit spend

Benchmark results

Reproducible and fully offlinePYTHONPATH=src uv run --no-sync python -m benchmark.run uses a deterministic bag-of-vocabulary embedder, so the harness measures the memory engine's ranking + supersession logic (not embedding noise) and costs zero Qwen credits. All three systems compete under the same shrinking token budget, so this is a fair context-efficiency test.

Context-efficiency curves

Context recall (retrieval-level, model-free) and staleness rate (fraction of retrieved contexts containing a retired fact; lower is better) vs the memory token budget, over the six-persona, 24-query synthetic set in benchmark/generate.py. Token budgets use tiktoken's gpt-4o-mini encoding as a consistent approximation for Qwen context accounting.

Budget (tokens) 8 16 32 64
B1 full-history — context recall / staleness 0.000 / 0.250 0.375 / 0.250 0.958 / 0.250 1.000 / 0.250
B2 naive top-k — context recall / staleness 0.875 / 0.125 1.000 / 0.250 1.000 / 0.250 1.000 / 0.250
B3 ours — context recall / staleness 1.000 / 0.000 1.000 / 0.000 1.000 / 0.000 1.000 / 0.000

B3 holds context recall 1.000 and staleness 0.000 at every budget — it's the only system that recalls the current facts and never re-surfaces retired ones. Two things the naive baselines can't do:

  • B1 (dump history chronologically) wastes its budget on the oldest facts, so it needs a large budget just to recall the current answer — and it permanently carries the stale one.
  • B2 (keyword top-k) gets staler as the budget grows: with no notion of "replaced," extra budget pulls retired facts back in, so its staleness climbs 0.125 → 0.250 and then plateaus.

Only supersession-aware forgetting + budget-constrained recall keeps the working set both correct and small.

The semantic supersession threshold is also checked against live DashScope text-embedding-v3 embeddings in docs/embedding-validation.md. That run did not produce a perfect validation: supersession-pair cosines were 0.879-0.908, while unrelated distractors were 0.683-0.743. The default SUPERSEDE_THRESHOLD=0.9 is therefore conservative and should be revisited with a larger set rather than treated as a proven universal constant.

License

MIT — see LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured