TokenSaver MCP

TokenSaver MCP

An MCP server that reduces AI API costs by up to 97% through token measurement, compression, caching, and pruning, all without changing prompts.

Category
Visit Server

README

<div align="center">

TokenSaver MCP

Cut your AI API costs by up to 97% — without changing a single prompt.

License Python MCP Tests

An MCP (Model Context Protocol) server that gives AI agents ten tools to measure, compress, cache, and prune token usage — so developers on limited plans can do more with less.

</div>


Why TokenSaver?

Every API call sends more tokens than necessary. Conversation history accumulates. Web pages arrive as raw HTML. Tool results get re-fetched on every turn. System prompts bloat over iterations.

TokenSaver intercepts each of these patterns and fixes them at the agent level — no model changes, no prompt engineering, no plan upgrades.

Scenario Before After Saved
10-turn conversation history 40,000 tokens 8,000 tokens 80%
Webpage fetch (raw HTML) 22,000 tokens 1,200 tokens 94%
Bloated system prompt 600 tokens 220 tokens 63%
Repeated tool call (cached) 1,500 tokens 50 tokens 97%

Tools

Tool What it does
count_tokens Measure token cost before sending — decide whether to compress first
compress_context Shrink long text or conversation history with offline LSA summarization
cache_store / cache_get / cache_invalidate Persist tool results to disk with TTL — never run the same lookup twice
extract_webpage Fetch a URL and return only the readable content, not raw HTML
summarize_file Get a structural + content summary of any file or directory
prune_conversation Remove filler turns and compress old messages in conversation history
optimize_prompt Shorten verbose system prompts while preserving constraints
advise_context_window Diagnose token bloat and get targeted recommendations

All tools work fully offline — no API key required for core features.


Installation

git clone https://github.com/pozii/tokensaver.git
cd tokensaver
pip install -e .

Python 3.11+ required. On first use, compress_context will auto-download the NLTK punkt_tab tokenizer (~2 MB) if not already present.


How it connects to your AI client

TokenSaver has no URL and runs no background server by default. It uses stdio transport: the AI client reads your config, spawns python -m tokensaver as a child process, and talks to it through stdin/stdout. You never open a port or start anything manually — the client does it for you when it launches.

Your AI client  ──spawn──▶  python -m tokensaver  ──stdio──▶  tools available

The alternative is SSE transport, where you start the server yourself on a local port and the client connects over HTTP. This is useful for multi-agent setups or when multiple clients share the same server instance.


Setup

Claude Desktop

Config file location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "tokensaver": {
      "command": "python",
      "args": ["-m", "tokensaver"]
    }
  }
}

Save the file and restart Claude Desktop. The tokensaver tools will appear in the tool list.

Claude Code

claude mcp add tokensaver -- python -m tokensaver

Or add manually to ~/.claude/settings.json:

{
  "mcpServers": {
    "tokensaver": {
      "command": "python",
      "args": ["-m", "tokensaver"]
    }
  }
}

OpenCode

Config file: ~/.config/opencode/config.json

{
  "mcp": {
    "servers": {
      "tokensaver": {
        "type": "local",
        "command": ["python", "-m", "tokensaver"]
      }
    }
  }
}

Any MCP-compatible client (SSE mode)

Start the server once:

python -m tokensaver --transport sse --port 8765

Then point your client at:

http://localhost:8765/sse

Fixing Python path issues

If your system has multiple Python versions and python resolves to the wrong one, use the full path:

# Find the right Python
which python3        # macOS / Linux
where python         # Windows

Then use the full path in your config:

{
  "mcpServers": {
    "tokensaver": {
      "command": "/usr/local/bin/python3",
      "args": ["-m", "tokensaver"]
    }
  }
}
{
  "mcpServers": {
    "tokensaver": {
      "command": "C:\\Python314\\python.exe",
      "args": ["-m", "tokensaver"]
    }
  }
}

Usage

Recommended workflow

Each turn:
  1. count_tokens          → How large is my current context?
  2. advise_context_window → Am I approaching the model's limit?

Before expensive tool calls:
  3. cache_get             → Did I already run this?

When fetching web content:
  4. extract_webpage       → Clean text, not raw HTML

When history grows long:
  5. prune_conversation    → Drop filler turns, compress old ones
  6. compress_context      → Shrink large injected context blocks

When writing system prompts:
  7. optimize_prompt       → Remove redundant phrasing

Tool reference

<details> <summary><strong>count_tokens</strong> — measure before you send</summary>

{
  "content": "Some long text or list of messages...",
  "model": "claude-sonnet-4",
  "include_message_overhead": true
}

Returns token_count, encoding_used, model. Accepts a plain string or an OpenAI-format message list.

</details>

<details> <summary><strong>compress_context</strong> — shrink long text</summary>

{
  "text": "3,000-token context block...",
  "target_tokens": 600,
  "mode": "extractive"
}

extractive (default) uses LSA sentence ranking — free, offline, no API call.
abstractive uses claude-haiku for higher quality — requires ANTHROPIC_API_KEY.

Returns compressed, original_tokens, compressed_tokens, reduction_pct.

</details>

<details> <summary><strong>cache_store / cache_get / cache_invalidate</strong> — skip repeated work</summary>

# Standard pattern: check before running
key = cache_key("extract_webpage", {"url": "https://example.com"})
hit = cache_get(key=key)

if not hit["hit"]:
    result = extract_webpage(url="https://example.com")
    cache_store(key=key, value=str(result), ttl_seconds=3600)

Cache is stored on disk at ~/.tokensaver/cache/ and survives server restarts.

</details>

<details> <summary><strong>extract_webpage</strong> — content, not markup</summary>

{
  "url": "https://example.com/article",
  "max_tokens": 2000,
  "include_links": false,
  "include_metadata": true
}

Uses trafilatura with BeautifulSoup as fallback. Returns content, title, token_count, truncated.

</details>

<details> <summary><strong>summarize_file</strong> — understand code without reading it all</summary>

{
  "path": "/home/user/myproject",
  "mode": "both",
  "max_tokens": 500,
  "file_extensions": [".py", ".md"],
  "max_depth": 3
}

mode options: "structure" (tree only), "content" (summarized text), "both".

</details>

<details> <summary><strong>prune_conversation</strong> — clean up history</summary>

{
  "messages": [...],
  "max_output_tokens": 2000,
  "keep_last_n": 4,
  "prune_strategy": "hybrid"
}

"remove" drops filler turns ("Sure!", "Got it.").
"compress" summarizes older turns in place.
"hybrid" does both — recommended for most cases.

Returns the pruned messages list, original_tokens, pruned_tokens, counts of removed/compressed turns.

</details>

<details> <summary><strong>optimize_prompt</strong> — shorter system prompts</summary>

{
  "prompt": "Please make sure to always answer questions...",
  "optimization_level": "medium",
  "preserve_constraints": true,
  "output_format": "prose"
}

"light" removes filler phrases. "medium" deduplicates sentences. "aggressive" restructures.
preserve_constraints: true always keeps sentences containing never, must, always, do not.

</details>

<details> <summary><strong>advise_context_window</strong> — know what to fix</summary>

{
  "model": "gpt-4o",
  "current_tokens": 110000,
  "messages": [...],
  "target_utilization": 0.75
}

Returns status ("ok" / "warning" / "critical"), headroom_tokens, prioritized recommendations, and a per-turn breakdown sorted by token cost.

Supports: GPT-4o, GPT-4o-mini, Claude 3–4 series, Gemini 1.5/2.0/2.5, O1/O3, Llama 3, Mistral.

</details>


Optional: LLM-backed summarization

For higher-quality abstractive compression on very large texts (>5,000 tokens):

pip install "tokensaver-mcp[llm]"

Set ANTHROPIC_API_KEY in your environment or a .env file, then use mode: "abstractive" in compress_context.


Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -v

38 tests — all offline, no API key or network required.


Project Structure

src/tokensaver/
  server.py          # FastMCP app, tool registration
  models.py          # Context window table, shared types
  tools/
    counter.py       # count_tokens
    compress.py      # compress_context
    cache.py         # cache_store / cache_get / cache_invalidate
    extractor.py     # extract_webpage
    summarizer.py    # summarize_file
    pruner.py        # prune_conversation
    optimizer.py     # optimize_prompt
    advisor.py       # advise_context_window
  utils/
    token_utils.py   # tiktoken wrapper
    text_utils.py    # sentence splitting, deduplication

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured