tool-smith

tool-smith

MCP server that routes natural language requests to structured tool calls using a LoRA-tuned small language model, with built-in validation, retry, and fallback recovery.

Category
Visit Server

README

tool-smith

LoRA-fine-tune a small model into a JSON tool-call router, serve it over MCP, and prove the lift with a from-scratch base-vs-tuned eval — plus an observable agent loop with failure recovery.

Python PEFT MCP Apple Silicon License CI

The arc, end to end: build the dataset → LoRA-SFT a 0.5B model → measure it against base on a hard held-out split → serve the tuned model over MCP so any agent can call it → wrap it in an agent loop that validates, repair-retries, and falls back. Everything here actually ran on a 16 GB Apple-Silicon Mac; the loss curve, the adapter, and the eval numbers are committed real outputs, not placeholders.

Teaching-scale on purpose. The point isn't to claim I trained a frontier model — it's to demonstrate that I can stand up the full PyTorch + PEFT + TRL training loop, build the data, read the loss curve, and prove an improvement with a rigorous eval.


The result (real, from python -m toolsmith.eval)

Qwen2.5-0.5B-Instruct, held-out test set of 97 cases (54 easy + 43 hard), graded by code against the exact gold tool + args:

metric (all 97) base LoRA-tuned Δ
valid JSON 95.9% 100.0% +4.1
schema-valid call 19.6% 94.8% +75.2
correct tool 36.1% 85.6% +49.5
exact args 4.1% 74.2% +70.1
fully correct 4.1% 74.2% +70.1

On the hard split (ambiguous wording, near-duplicate tools, distractors) the base model gets 0% fully correct; the tuned model gets 76.7%.

base vs tuned

The story is clean and honest: the base 0.5B already knows JSON syntax (95.9% valid) but doesn't follow the tool schema (4.1% exact args). LoRA SFT teaches it the schema — without touching syntax it already had.

Generalization to hand-written (non-templated) inputs

The training/test data is templated, so the obvious question is "does it generalize beyond the templates?" data/real_test.jsonl is 12 hand-written, naturalistic requests (e.g. "is it shorts weather in Athens right now or should I bring a jacket", "shoot Priya a message, subject 'Q3 numbers'…") — never seen in any template. Run python -m toolsmith.eval --testfile data/real_test.jsonl --tag _real:

metric (12 hand-written) base tuned Δ
schema-valid 25.0% 100.0% +75.0
correct tool 41.7% 83.3% +41.6
fully correct 8.3% 58.3% +50.0

The lift holds on genuinely out-of-distribution phrasing — tool selection and schema adherence generalize strongly; fully_correct (58.3%) is honestly lower than the templated 74.2%, because exact-arg matching on free-form text (e.g. "next Thursday" → a date string) is harder. That gap is the real generalization cost, reported rather than hidden.

The training run (real loss curve)

LoRA rank 16 on attention+MLP projections (~8.8M trainable params, 1.75% of the model), 3 epochs, ~6.5 min on MPS. train_loss 4.3 → 0.35.

training loss

Quickstart

pip install -e .                                   # MCP server + agent + grader (light deps)
pip install -r requirements-train.txt              # torch/transformers/peft/trl/... for training

python -m toolsmith.data.build      # -> data/train.jsonl (243), data/test.jsonl (97)
python -m toolsmith.train           # LoRA SFT -> artifacts/adapter + artifacts/loss.png
python -m toolsmith.eval            # base vs tuned -> artifacts/eval_report.md + eval_chart.png
python -m toolsmith.agent --demo    # offline recovery demo -> logs/run-demo.jsonl

Serve the tuned model over MCP

python -m toolsmith.mcp_server      # stdio; exposes route_to_tool(request) + the 8 tools
# or containerized (installs the inference stack, pulls the base model on first run):
docker build -t tool-smith . && docker run --rm -i tool-smith

mcp.json for Claude Desktop / Cursor:

{
  "mcpServers": {
    "tool-smith": {
      "command": "python",
      "args": ["-m", "toolsmith.mcp_server"],
      "cwd": "/path/to/tool-smith"
    }
  }
}

route_to_tool("What's the weather in Tokyo?"){"tool": "get_weather", "args": {"city": "Tokyo"}, "valid": true, ...}.

The agent loop (validation + recovery + observability)

agent.py wraps the router: route → parse → validate against the tool schema → on failure, repair-retry with the error fed back → if still failing, fall back to a frontier/rule router → execute. Every step is appended to logs/run-*.jsonl (raw output, latency, validation verdict, retry count, recovery action). A real model-backed run (logs/run-model.jsonl) exercises all three paths:

ok=True recovery=none              | What's the weather in Tokyo?        (tuned, 1 attempt)
ok=True recovery=repair_retry      | Pack for Berlin? ... rain there.    (base model failed, retry fixed it)
ok=True recovery=frontier_fallback | ...what's sitting in refunds...     (base failed x3 -> fallback router)

python -m toolsmith.logs_report → success rate, recovery breakdown, latency. The recovery logic is unit-tested with stub routers (tests/test_agent.py), so it's verified without a model.

Layout

toolsmith/
  schema.py        # the fixed 8-tool toolbox + JSON validator (one source of truth)
  data/build.py    # deterministic dataset; TRAIN/TEST templates are DISJOINT + a hard split
  train.py         # PEFT LoRA SFT via TRL SFTTrainer (MPS), saves adapter + loss.png
  router.py        # load base (+adapter) and turn a request into a tool-call string
  eval.py          # base vs tuned, code-graded per bucket -> report.md + chart.png + csv
  grade.py         # exact tool/args grading (no LLM judge needed for routing)
  mcp_server.py    # FastMCP: route_to_tool + 8 mock tools
  agent.py         # validate / repair-retry / frontier-fallback loop + JSONL logging
  logs_report.py   # summarize agent runs
data/              # committed train/test jsonl
artifacts/         # committed: adapter/, loss.png, eval_report.md, eval_chart.png, eval.csv
logs/              # committed real agent traces
tests/             # pytest (grading + agent recovery), model-free

Limitations & next steps

Stated plainly — knowing the limits is part of the work:

  • Teaching-scale: 0.5B model · LoRA (PEFT) · SFT-only — an adapter (~35 MB), not a full or from-scratch fine-tune, not algorithm research, not large-scale/distributed training.
  • Synthetic data: ~243 train / 97 test are templated (though TRAIN/TEST templates are disjoint + a hard split, and the 12 hand-written cases above show real generalization). Real user traffic is messier; the honest free-form fully_correct is 58% vs 74% templated.
  • Mock tools: the 8 tool bodies are stubs — the contribution is the routing model + eval + MCP serving + agent loop, not the tools.
  • Single base model, no judge in the headline metric (routing has checkable ground truth, so it's code-graded; the optional LLM-judge column needs a key).
  • Next steps if taken further: train on real (de-identified) request logs, add tool-arg-type coercion in the agent, compare LoRA ranks / a 1.5B base, add function-calling-format export (OpenAI/Anthropic tool schemas), and a serving latency benchmark.
  • Every number here comes from an actual local run, regenerable (fixed seed; requirements-train.txt pins the exact stack). No placeholder figures.

License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured