vibevoice-asr

vibevoice-asr

Local speech-to-text transcription using Microsoft's VibeVoice-ASR model with speaker diarization, enabling audio transcription directly in AI tools like Claude Code, Cursor, and OpenCode.

Category
Visit Server

README

VibeVoice-ASR Server

Local speech-to-text using Microsoft's VibeVoice-ASR model. Run it as an OpenAI-compatible API server or as an MCP server that plugs directly into Claude Code, OpenCode, Cursor, and other AI tools.

  • Automatic speaker diarization
  • Timestamps on every segment
  • Output as plain text, JSON, SRT, or VTT
  • Runs on CUDA, Apple Silicon (MPS), or CPU
  • Model downloads automatically on first run

Requirements

  • Python 3.10+
  • FFmpeg (used by the model's audio processor)

Install FFmpeg:

# macOS
brew install ffmpeg

# Ubuntu / Debian
sudo apt-get install ffmpeg

# Windows (with Chocolatey)
choco install ffmpeg

Quick Start

# Clone the repo
git clone https://github.com/tjameswilliams/vibevoice-server.git
cd vibevoice-server

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install
pip install -e .

# For NVIDIA GPU acceleration (optional)
pip install -e ".[cuda]"

The first time you run either the API server or MCP server, the model (~3 GB) will be downloaded from HuggingFace and cached locally.


Option 1: OpenAI-Compatible API Server

Start the server:

vibevoice-server

The server starts on http://localhost:8000 by default. It exposes the same endpoint shape as the OpenAI Audio API, so any client library or tool that speaks that protocol works out of the box.

CLI Options

vibevoice-server [OPTIONS]

  --host        Bind address (default: 0.0.0.0)
  --port        Bind port (default: 8000)
  --device      Device: auto, cuda, mps, cpu (default: auto)
  --dtype       Data type: auto, bfloat16, float32 (default: auto)
  --log-level   Log level: debug, info, warning, error (default: info)

Transcribe Audio

curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@meeting.wav \
  -F response_format=verbose_json

Parameters:

Parameter Type Default Description
file file required Audio file (wav, mp3, flac, m4a, ogg, etc.)
model string vibevoice-asr Model identifier (accepted but ignored)
response_format string json text, json, verbose_json, srt, vtt
prompt string Optional context to guide transcription
language string Language code (used in verbose_json output)

Response Formats

json (default):

{"text": "Hello, welcome to the meeting."}

verbose_json — includes timestamps, speaker IDs, and segments:

{
  "task": "transcribe",
  "language": "en",
  "duration": 12.5,
  "text": "Hello, welcome to the meeting.",
  "segments": [
    {"id": 0, "start": 0.0, "end": 3.2, "text": "Hello, welcome to the meeting.", "speaker": 0}
  ]
}

srt and vtt — subtitle formats with speaker labels, ready to use with video players.

text — plain transcript string, no JSON wrapper.

Other Endpoints

# List models
curl http://localhost:8000/v1/models

# Health check
curl http://localhost:8000/health

Using with OpenAI Client Libraries

Point any OpenAI SDK at your local server:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("recording.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="vibevoice-asr",
        file=f,
        response_format="verbose_json",
    )

print(transcript.text)

Docker

# Build
docker build -t vibevoice-server .

# Run (CPU)
docker run -p 8000:8000 -v vibevoice-cache:/models vibevoice-server

# Run (NVIDIA GPU)
docker run --gpus all -p 8000:8000 -v vibevoice-cache:/models vibevoice-server

Option 2: MCP Server

The MCP (Model Context Protocol) server lets AI tools call transcription directly — no HTTP server needed. The model runs in the same process as the MCP server.

MCP Tools

Tool Description
transcribe_audio Transcribe an audio file. Pass an absolute file path and get back the transcript.
load_vibevoice_model Pre-load the model into memory (~60-90s). Optional — the model loads automatically on first transcription.
get_vibevoice_status Check whether the model is loaded, and which device/dtype it's using.

transcribe_audio parameters:

Parameter Type Default Description
file_path string required Absolute path to the audio file
response_format string text text, json, verbose_json, srt, vtt
prompt string Optional context to guide transcription
language string Language code (for verbose_json output)

Claude Code

Add to your project's .mcp.json (or ~/.claude/mcp.json for global access):

{
  "mcpServers": {
    "vibevoice-asr": {
      "command": "vibevoice-mcp",
      "args": []
    }
  }
}

With device override:

{
  "mcpServers": {
    "vibevoice-asr": {
      "command": "vibevoice-mcp",
      "args": ["--device", "mps"]
    }
  }
}

Restart Claude Code after adding the config. The three tools (transcribe_audio, load_vibevoice_model, get_vibevoice_status) will appear automatically.

Cursor

Add to .cursor/mcp.json in your project root:

{
  "mcpServers": {
    "vibevoice-asr": {
      "command": "vibevoice-mcp",
      "args": []
    }
  }
}

OpenCode

Add to your OpenCode MCP configuration (opencode.json or via settings):

{
  "mcpServers": {
    "vibevoice-asr": {
      "command": "vibevoice-mcp",
      "args": []
    }
  }
}

Any MCP-Compatible Tool

The server uses stdio transport — the standard for local MCP servers. Any tool that supports MCP can run it with:

  • Command: vibevoice-mcp
  • Args: [] (optional: ["--device", "mps"] or ["--device", "cuda"])
  • Transport: stdio

The MCP server reads JSON-RPC from stdin and writes responses to stdout. All logs go to stderr.

MCP CLI Options

vibevoice-mcp [OPTIONS]

  --device      Device: auto, cuda, mps, cpu (default: auto)
  --dtype       Data type: auto, bfloat16, float32 (default: auto)
  --log-level   Log level (default: warning)

Configuration

All settings can be controlled via environment variables (prefixed with VIBEVOICE_), CLI flags, or a .env file. See .env.example for the full list.

Variable Default Description
VIBEVOICE_DEVICE auto auto, cuda, mps, cpu
VIBEVOICE_DTYPE auto auto, bfloat16, float32
VIBEVOICE_CACHE_DIR (HuggingFace default) Where to store downloaded model weights
VIBEVOICE_MODEL_ID microsoft/VibeVoice-ASR-HF HuggingFace model ID
VIBEVOICE_HOST 0.0.0.0 API server bind address
VIBEVOICE_PORT 8000 API server bind port
VIBEVOICE_LOG_LEVEL info Logging level

Device auto-detection picks the best available: CUDA > MPS > CPU.


Hardware Notes

Platform Device Dtype Notes
NVIDIA GPU cuda bfloat16 Fastest. Flash Attention 2 enabled automatically. Install with .[cuda].
Apple Silicon mps float32 Works well on M1/M2/M3/M4.
CPU cpu float32 Slower but works everywhere.

The model is ~3 GB. First load takes 60-90 seconds (downloading + loading weights). Subsequent starts are faster when cached.


License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured