Unsloth MCP Server
Provides tools for optimizing, fine-tuning, and deploying large language models with Unsloth, enabling 2x faster training with 80% less memory through model loading, fine-tuning, text generation, and model export capabilities.
OtotaO
README
Unsloth MCP Server
An MCP server for Unsloth - a library that makes LLM fine-tuning 2x faster with 80% less memory.
What is Unsloth?
Unsloth is a library that dramatically improves the efficiency of fine-tuning large language models:
- Speed: 2x faster fine-tuning compared to standard methods
- Memory: 80% less VRAM usage, allowing fine-tuning of larger models on consumer GPUs
- Context Length: Up to 13x longer context lengths (e.g., 89K tokens for Llama 3.3 on 80GB GPUs)
- Accuracy: No loss in model quality or performance
Unsloth achieves these improvements through custom CUDA kernels written in OpenAI's Triton language, optimized backpropagation, and dynamic 4-bit quantization.
Features
- Optimize fine-tuning for Llama, Mistral, Phi, Gemma, and other models
- 4-bit quantization for efficient training
- Extended context length support
- Simple API for model loading, fine-tuning, and inference
- Export to various formats (GGUF, Hugging Face, etc.)
Quick Start
- Install Unsloth:
pip install unsloth
- Install and build the server:
cd unsloth-server npm install npm run build
- Add to MCP settings:
{ "mcpServers": { "unsloth-server": { "command": "node", "args": ["/path/to/unsloth-server/build/index.js"], "env": { "HUGGINGFACE_TOKEN": "your_token_here" // Optional }, "disabled": false, "autoApprove": [] } } }
Available Tools
check_installation
Verify if Unsloth is properly installed on your system.
Parameters: None
Example:
const result = await use_mcp_tool({
server_name: "unsloth-server",
tool_name: "check_installation",
arguments: {}
});
list_supported_models
Get a list of all models supported by Unsloth, including Llama, Mistral, Phi, and Gemma variants.
Parameters: None
Example:
const result = await use_mcp_tool({
server_name: "unsloth-server",
tool_name: "list_supported_models",
arguments: {}
});
load_model
Load a pretrained model with Unsloth optimizations for faster inference and fine-tuning.
Parameters:
model_name
(required): Name of the model to load (e.g., "unsloth/Llama-3.2-1B")max_seq_length
(optional): Maximum sequence length for the model (default: 2048)load_in_4bit
(optional): Whether to load the model in 4-bit quantization (default: true)use_gradient_checkpointing
(optional): Whether to use gradient checkpointing to save memory (default: true)
Example:
const result = await use_mcp_tool({
server_name: "unsloth-server",
tool_name: "load_model",
arguments: {
model_name: "unsloth/Llama-3.2-1B",
max_seq_length: 4096,
load_in_4bit: true
}
});
finetune_model
Fine-tune a model with Unsloth optimizations using LoRA/QLoRA techniques.
Parameters:
model_name
(required): Name of the model to fine-tunedataset_name
(required): Name of the dataset to use for fine-tuningoutput_dir
(required): Directory to save the fine-tuned modelmax_seq_length
(optional): Maximum sequence length for training (default: 2048)lora_rank
(optional): Rank for LoRA fine-tuning (default: 16)lora_alpha
(optional): Alpha for LoRA fine-tuning (default: 16)batch_size
(optional): Batch size for training (default: 2)gradient_accumulation_steps
(optional): Number of gradient accumulation steps (default: 4)learning_rate
(optional): Learning rate for training (default: 2e-4)max_steps
(optional): Maximum number of training steps (default: 100)dataset_text_field
(optional): Field in the dataset containing the text (default: 'text')load_in_4bit
(optional): Whether to use 4-bit quantization (default: true)
Example:
const result = await use_mcp_tool({
server_name: "unsloth-server",
tool_name: "finetune_model",
arguments: {
model_name: "unsloth/Llama-3.2-1B",
dataset_name: "tatsu-lab/alpaca",
output_dir: "./fine-tuned-model",
max_steps: 100,
batch_size: 2,
learning_rate: 2e-4
}
});
generate_text
Generate text using a fine-tuned Unsloth model.
Parameters:
model_path
(required): Path to the fine-tuned modelprompt
(required): Prompt for text generationmax_new_tokens
(optional): Maximum number of tokens to generate (default: 256)temperature
(optional): Temperature for text generation (default: 0.7)top_p
(optional): Top-p for text generation (default: 0.9)
Example:
const result = await use_mcp_tool({
server_name: "unsloth-server",
tool_name: "generate_text",
arguments: {
model_path: "./fine-tuned-model",
prompt: "Write a short story about a robot learning to paint:",
max_new_tokens: 512,
temperature: 0.8
}
});
export_model
Export a fine-tuned Unsloth model to various formats for deployment.
Parameters:
model_path
(required): Path to the fine-tuned modelexport_format
(required): Format to export to (gguf, ollama, vllm, huggingface)output_path
(required): Path to save the exported modelquantization_bits
(optional): Bits for quantization (for GGUF export) (default: 4)
Example:
const result = await use_mcp_tool({
server_name: "unsloth-server",
tool_name: "export_model",
arguments: {
model_path: "./fine-tuned-model",
export_format: "gguf",
output_path: "./exported-model.gguf",
quantization_bits: 4
}
});
Advanced Usage
Custom Datasets
You can use custom datasets by formatting them properly and hosting them on Hugging Face or providing a local path:
const result = await use_mcp_tool({
server_name: "unsloth-server",
tool_name: "finetune_model",
arguments: {
model_name: "unsloth/Llama-3.2-1B",
dataset_name: "json",
data_files: {"train": "path/to/your/data.json"},
output_dir: "./fine-tuned-model"
}
});
Memory Optimization
For large models on limited hardware:
- Reduce batch size and increase gradient accumulation steps
- Use 4-bit quantization
- Enable gradient checkpointing
- Reduce sequence length if possible
Troubleshooting
Common Issues
- CUDA Out of Memory: Reduce batch size, use 4-bit quantization, or try a smaller model
- Import Errors: Ensure you have the correct versions of torch, transformers, and unsloth installed
- Model Not Found: Check that you're using a supported model name or have access to private models
Version Compatibility
- Python: 3.10, 3.11, or 3.12 (not 3.13)
- CUDA: 11.8 or 12.1+ recommended
- PyTorch: 2.0+ recommended
Performance Benchmarks
Model | VRAM | Unsloth Speed | VRAM Reduction | Context Length |
---|---|---|---|---|
Llama 3.3 (70B) | 80GB | 2x faster | >75% | 13x longer |
Llama 3.1 (8B) | 80GB | 2x faster | >70% | 12x longer |
Mistral v0.3 (7B) | 80GB | 2.2x faster | 75% less | - |
Requirements
- Python 3.10-3.12
- NVIDIA GPU with CUDA support (recommended)
- Node.js and npm
License
Apache-2.0
Recommended Servers
Crypto Price & Market Analysis MCP Server
A Model Context Protocol (MCP) server that provides comprehensive cryptocurrency analysis using the CoinCap API. This server offers real-time price data, market analysis, and historical trends through an easy-to-use interface.
MCP PubMed Search
Server to search PubMed (PubMed is a free, online database that allows users to search for biomedical and life sciences literature). I have created on a day MCP came out but was on vacation, I saw someone post similar server in your DB, but figured to post mine.
dbt Semantic Layer MCP Server
A server that enables querying the dbt Semantic Layer through natural language conversations with Claude Desktop and other AI assistants, allowing users to discover metrics, create queries, analyze data, and visualize results.
mixpanel
Connect to your Mixpanel data. Query events, retention, and funnel data from Mixpanel analytics.

Sequential Thinking MCP Server
This server facilitates structured problem-solving by breaking down complex issues into sequential steps, supporting revisions, and enabling multiple solution paths through full MCP integration.

Nefino MCP Server
Provides large language models with access to news and information about renewable energy projects in Germany, allowing filtering by location, topic (solar, wind, hydrogen), and date range.
Vectorize
Vectorize MCP server for advanced retrieval, Private Deep Research, Anything-to-Markdown file extraction and text chunking.
Mathematica Documentation MCP server
A server that provides access to Mathematica documentation through FastMCP, enabling users to retrieve function documentation and list package symbols from Wolfram Mathematica.
kb-mcp-server
An MCP server aimed to be portable, local, easy and convenient to support semantic/graph based retrieval of txtai "all in one" embeddings database. Any txtai embeddings db in tar.gz form can be loaded
Research MCP Server
The server functions as an MCP server to interact with Notion for retrieving and creating survey data, integrating with the Claude Desktop Client for conducting and reviewing surveys.