vibevoice-asr
Local speech-to-text transcription using Microsoft's VibeVoice-ASR model with speaker diarization, enabling audio transcription directly in AI tools like Claude Code, Cursor, and OpenCode.
README
VibeVoice-ASR Server
Local speech-to-text using Microsoft's VibeVoice-ASR model. Run it as an OpenAI-compatible API server or as an MCP server that plugs directly into Claude Code, OpenCode, Cursor, and other AI tools.
- Automatic speaker diarization
- Timestamps on every segment
- Output as plain text, JSON, SRT, or VTT
- Runs on CUDA, Apple Silicon (MPS), or CPU
- Model downloads automatically on first run
Requirements
- Python 3.10+
- FFmpeg (used by the model's audio processor)
Install FFmpeg:
# macOS
brew install ffmpeg
# Ubuntu / Debian
sudo apt-get install ffmpeg
# Windows (with Chocolatey)
choco install ffmpeg
Quick Start
# Clone the repo
git clone https://github.com/tjameswilliams/vibevoice-server.git
cd vibevoice-server
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install
pip install -e .
# For NVIDIA GPU acceleration (optional)
pip install -e ".[cuda]"
The first time you run either the API server or MCP server, the model (~3 GB) will be downloaded from HuggingFace and cached locally.
Option 1: OpenAI-Compatible API Server
Start the server:
vibevoice-server
The server starts on http://localhost:8000 by default. It exposes the same endpoint shape as the OpenAI Audio API, so any client library or tool that speaks that protocol works out of the box.
CLI Options
vibevoice-server [OPTIONS]
--host Bind address (default: 0.0.0.0)
--port Bind port (default: 8000)
--device Device: auto, cuda, mps, cpu (default: auto)
--dtype Data type: auto, bfloat16, float32 (default: auto)
--log-level Log level: debug, info, warning, error (default: info)
Transcribe Audio
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@meeting.wav \
-F response_format=verbose_json
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
file | required | Audio file (wav, mp3, flac, m4a, ogg, etc.) |
model |
string | vibevoice-asr |
Model identifier (accepted but ignored) |
response_format |
string | json |
text, json, verbose_json, srt, vtt |
prompt |
string | Optional context to guide transcription | |
language |
string | Language code (used in verbose_json output) |
Response Formats
json (default):
{"text": "Hello, welcome to the meeting."}
verbose_json — includes timestamps, speaker IDs, and segments:
{
"task": "transcribe",
"language": "en",
"duration": 12.5,
"text": "Hello, welcome to the meeting.",
"segments": [
{"id": 0, "start": 0.0, "end": 3.2, "text": "Hello, welcome to the meeting.", "speaker": 0}
]
}
srt and vtt — subtitle formats with speaker labels, ready to use with video players.
text — plain transcript string, no JSON wrapper.
Other Endpoints
# List models
curl http://localhost:8000/v1/models
# Health check
curl http://localhost:8000/health
Using with OpenAI Client Libraries
Point any OpenAI SDK at your local server:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
with open("recording.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="vibevoice-asr",
file=f,
response_format="verbose_json",
)
print(transcript.text)
Docker
# Build
docker build -t vibevoice-server .
# Run (CPU)
docker run -p 8000:8000 -v vibevoice-cache:/models vibevoice-server
# Run (NVIDIA GPU)
docker run --gpus all -p 8000:8000 -v vibevoice-cache:/models vibevoice-server
Option 2: MCP Server
The MCP (Model Context Protocol) server lets AI tools call transcription directly — no HTTP server needed. The model runs in the same process as the MCP server.
MCP Tools
| Tool | Description |
|---|---|
transcribe_audio |
Transcribe an audio file. Pass an absolute file path and get back the transcript. |
load_vibevoice_model |
Pre-load the model into memory (~60-90s). Optional — the model loads automatically on first transcription. |
get_vibevoice_status |
Check whether the model is loaded, and which device/dtype it's using. |
transcribe_audio parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
string | required | Absolute path to the audio file |
response_format |
string | text |
text, json, verbose_json, srt, vtt |
prompt |
string | Optional context to guide transcription | |
language |
string | Language code (for verbose_json output) |
Claude Code
Add to your project's .mcp.json (or ~/.claude/mcp.json for global access):
{
"mcpServers": {
"vibevoice-asr": {
"command": "vibevoice-mcp",
"args": []
}
}
}
With device override:
{
"mcpServers": {
"vibevoice-asr": {
"command": "vibevoice-mcp",
"args": ["--device", "mps"]
}
}
}
Restart Claude Code after adding the config. The three tools (transcribe_audio, load_vibevoice_model, get_vibevoice_status) will appear automatically.
Cursor
Add to .cursor/mcp.json in your project root:
{
"mcpServers": {
"vibevoice-asr": {
"command": "vibevoice-mcp",
"args": []
}
}
}
OpenCode
Add to your OpenCode MCP configuration (opencode.json or via settings):
{
"mcpServers": {
"vibevoice-asr": {
"command": "vibevoice-mcp",
"args": []
}
}
}
Any MCP-Compatible Tool
The server uses stdio transport — the standard for local MCP servers. Any tool that supports MCP can run it with:
- Command:
vibevoice-mcp - Args:
[](optional:["--device", "mps"]or["--device", "cuda"]) - Transport: stdio
The MCP server reads JSON-RPC from stdin and writes responses to stdout. All logs go to stderr.
MCP CLI Options
vibevoice-mcp [OPTIONS]
--device Device: auto, cuda, mps, cpu (default: auto)
--dtype Data type: auto, bfloat16, float32 (default: auto)
--log-level Log level (default: warning)
Configuration
All settings can be controlled via environment variables (prefixed with VIBEVOICE_), CLI flags, or a .env file. See .env.example for the full list.
| Variable | Default | Description |
|---|---|---|
VIBEVOICE_DEVICE |
auto |
auto, cuda, mps, cpu |
VIBEVOICE_DTYPE |
auto |
auto, bfloat16, float32 |
VIBEVOICE_CACHE_DIR |
(HuggingFace default) | Where to store downloaded model weights |
VIBEVOICE_MODEL_ID |
microsoft/VibeVoice-ASR-HF |
HuggingFace model ID |
VIBEVOICE_HOST |
0.0.0.0 |
API server bind address |
VIBEVOICE_PORT |
8000 |
API server bind port |
VIBEVOICE_LOG_LEVEL |
info |
Logging level |
Device auto-detection picks the best available: CUDA > MPS > CPU.
Hardware Notes
| Platform | Device | Dtype | Notes |
|---|---|---|---|
| NVIDIA GPU | cuda |
bfloat16 |
Fastest. Flash Attention 2 enabled automatically. Install with .[cuda]. |
| Apple Silicon | mps |
float32 |
Works well on M1/M2/M3/M4. |
| CPU | cpu |
float32 |
Slower but works everywhere. |
The model is ~3 GB. First load takes 60-90 seconds (downloading + loading weights). Subsequent starts are faster when cached.
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.