tool-smith
MCP server that routes natural language requests to structured tool calls using a LoRA-tuned small language model, with built-in validation, retry, and fallback recovery.
README
tool-smith
LoRA-fine-tune a small model into a JSON tool-call router, serve it over MCP, and prove the lift with a from-scratch base-vs-tuned eval — plus an observable agent loop with failure recovery.
The arc, end to end: build the dataset → LoRA-SFT a 0.5B model → measure it against base on a hard held-out split → serve the tuned model over MCP so any agent can call it → wrap it in an agent loop that validates, repair-retries, and falls back. Everything here actually ran on a 16 GB Apple-Silicon Mac; the loss curve, the adapter, and the eval numbers are committed real outputs, not placeholders.
Teaching-scale on purpose. The point isn't to claim I trained a frontier model — it's to demonstrate that I can stand up the full PyTorch + PEFT + TRL training loop, build the data, read the loss curve, and prove an improvement with a rigorous eval.
The result (real, from python -m toolsmith.eval)
Qwen2.5-0.5B-Instruct, held-out test set of 97 cases (54 easy + 43 hard), graded by code against the exact gold tool + args:
| metric (all 97) | base | LoRA-tuned | Δ |
|---|---|---|---|
| valid JSON | 95.9% | 100.0% | +4.1 |
| schema-valid call | 19.6% | 94.8% | +75.2 |
| correct tool | 36.1% | 85.6% | +49.5 |
| exact args | 4.1% | 74.2% | +70.1 |
| fully correct | 4.1% | 74.2% | +70.1 |
On the hard split (ambiguous wording, near-duplicate tools, distractors) the base model gets 0% fully correct; the tuned model gets 76.7%.

The story is clean and honest: the base 0.5B already knows JSON syntax (95.9% valid) but doesn't follow the tool schema (4.1% exact args). LoRA SFT teaches it the schema — without touching syntax it already had.
Generalization to hand-written (non-templated) inputs
The training/test data is templated, so the obvious question is "does it generalize beyond the templates?" data/real_test.jsonl is 12 hand-written, naturalistic requests (e.g. "is it shorts weather in Athens right now or should I bring a jacket", "shoot Priya a message, subject 'Q3 numbers'…") — never seen in any template. Run python -m toolsmith.eval --testfile data/real_test.jsonl --tag _real:
| metric (12 hand-written) | base | tuned | Δ |
|---|---|---|---|
| schema-valid | 25.0% | 100.0% | +75.0 |
| correct tool | 41.7% | 83.3% | +41.6 |
| fully correct | 8.3% | 58.3% | +50.0 |
The lift holds on genuinely out-of-distribution phrasing — tool selection and schema adherence generalize strongly; fully_correct (58.3%) is honestly lower than the templated 74.2%, because exact-arg matching on free-form text (e.g. "next Thursday" → a date string) is harder. That gap is the real generalization cost, reported rather than hidden.
The training run (real loss curve)
LoRA rank 16 on attention+MLP projections (~8.8M trainable params, 1.75% of the model), 3 epochs, ~6.5 min on MPS. train_loss 4.3 → 0.35.

Quickstart
pip install -e . # MCP server + agent + grader (light deps)
pip install -r requirements-train.txt # torch/transformers/peft/trl/... for training
python -m toolsmith.data.build # -> data/train.jsonl (243), data/test.jsonl (97)
python -m toolsmith.train # LoRA SFT -> artifacts/adapter + artifacts/loss.png
python -m toolsmith.eval # base vs tuned -> artifacts/eval_report.md + eval_chart.png
python -m toolsmith.agent --demo # offline recovery demo -> logs/run-demo.jsonl
Serve the tuned model over MCP
python -m toolsmith.mcp_server # stdio; exposes route_to_tool(request) + the 8 tools
# or containerized (installs the inference stack, pulls the base model on first run):
docker build -t tool-smith . && docker run --rm -i tool-smith
mcp.json for Claude Desktop / Cursor:
{
"mcpServers": {
"tool-smith": {
"command": "python",
"args": ["-m", "toolsmith.mcp_server"],
"cwd": "/path/to/tool-smith"
}
}
}
route_to_tool("What's the weather in Tokyo?") → {"tool": "get_weather", "args": {"city": "Tokyo"}, "valid": true, ...}.
The agent loop (validation + recovery + observability)
agent.py wraps the router: route → parse → validate against the tool schema → on failure, repair-retry with the error fed back → if still failing, fall back to a frontier/rule router → execute. Every step is appended to logs/run-*.jsonl (raw output, latency, validation verdict, retry count, recovery action). A real model-backed run (logs/run-model.jsonl) exercises all three paths:
ok=True recovery=none | What's the weather in Tokyo? (tuned, 1 attempt)
ok=True recovery=repair_retry | Pack for Berlin? ... rain there. (base model failed, retry fixed it)
ok=True recovery=frontier_fallback | ...what's sitting in refunds... (base failed x3 -> fallback router)
python -m toolsmith.logs_report → success rate, recovery breakdown, latency. The recovery logic is unit-tested with stub routers (tests/test_agent.py), so it's verified without a model.
Layout
toolsmith/
schema.py # the fixed 8-tool toolbox + JSON validator (one source of truth)
data/build.py # deterministic dataset; TRAIN/TEST templates are DISJOINT + a hard split
train.py # PEFT LoRA SFT via TRL SFTTrainer (MPS), saves adapter + loss.png
router.py # load base (+adapter) and turn a request into a tool-call string
eval.py # base vs tuned, code-graded per bucket -> report.md + chart.png + csv
grade.py # exact tool/args grading (no LLM judge needed for routing)
mcp_server.py # FastMCP: route_to_tool + 8 mock tools
agent.py # validate / repair-retry / frontier-fallback loop + JSONL logging
logs_report.py # summarize agent runs
data/ # committed train/test jsonl
artifacts/ # committed: adapter/, loss.png, eval_report.md, eval_chart.png, eval.csv
logs/ # committed real agent traces
tests/ # pytest (grading + agent recovery), model-free
Limitations & next steps
Stated plainly — knowing the limits is part of the work:
- Teaching-scale: 0.5B model · LoRA (PEFT) · SFT-only — an adapter (~35 MB), not a full or from-scratch fine-tune, not algorithm research, not large-scale/distributed training.
- Synthetic data: ~243 train / 97 test are templated (though TRAIN/TEST templates are disjoint + a hard split, and the 12 hand-written cases above show real generalization). Real user traffic is messier; the honest free-form
fully_correctis 58% vs 74% templated. - Mock tools: the 8 tool bodies are stubs — the contribution is the routing model + eval + MCP serving + agent loop, not the tools.
- Single base model, no judge in the headline metric (routing has checkable ground truth, so it's code-graded; the optional LLM-judge column needs a key).
- Next steps if taken further: train on real (de-identified) request logs, add tool-arg-type coercion in the agent, compare LoRA ranks / a 1.5B base, add function-calling-format export (OpenAI/Anthropic tool schemas), and a serving latency benchmark.
- Every number here comes from an actual local run, regenerable (fixed seed;
requirements-train.txtpins the exact stack). No placeholder figures.
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.