audio-transcription-mcp
MCP server for audio transcription with speaker diarization. Transcribes MP3/WAV files using Faster-Whisper and pyannote.audio, outputs markdown with speaker labels, timestamps, summaries, and action items.
README
Audio Transcription MCP
MCP (Model Context Protocol) server for audio transcription with speaker diarization. Transcribes MP3/WAV files using Faster-Whisper and pyannote.audio, outputting markdown with speaker labels, timestamps, summaries, and action items.
⨠Features
- š¤ Speaker Diarization - Identifies and labels different speakers (Speaker 1, Speaker 2, etc.)
- š Markdown Output - Clean, formatted transcripts with timestamps
- š³ Docker Ready - CPU and GPU containers for easy deployment
- š MCP Protocol - Integrates with GitHub Copilot CLI and other MCP clients
- š Offline Capable - Models cached locally after first run
- ā” GPU Acceleration - NVIDIA CUDA support for faster processing
š Requirements
Prerequisites
- Python 3.11+ (for local development)
- Docker (recommended for deployment)
- Hugging Face Account (free, for model access)
- NVIDIA GPU + CUDA 12.3 (optional, for GPU acceleration)
Hugging Face Setup (Required)
- Create a free account at huggingface.co
- Accept model terms:
- Generate a token at huggingface.co/settings/tokens
š Quick Start
Option 1: Docker (Recommended)
# Clone the repository
git clone https://github.com/ebmarquez/audio-transcription-mcp.git
cd audio-transcription-mcp
# Create .env file with your HF token
echo "HF_TOKEN=hf_your_token_here" > .env
# Build and run with Docker Compose
cd docker
docker compose up -d
# Container is now running at http://localhost:8080/mcp
Option 2: Docker Run (One-Shot)
# CPU version
docker run --rm \
-e HF_TOKEN="hf_your_token" \
-v $(pwd)/input:/input:ro \
-v $(pwd)/output:/output \
-v $(pwd)/models:/root/.cache \
-p 8080:8080 \
audio-transcription-mcp:cpu
# GPU version (NVIDIA)
docker run --rm --gpus all \
-e HF_TOKEN="hf_your_token" \
-v $(pwd)/input:/input:ro \
-v $(pwd)/output:/output \
-v $(pwd)/models:/root/.cache \
-p 8080:8080 \
audio-transcription-mcp:gpu
Option 3: Local Development
# Clone and install
git clone https://github.com/ebmarquez/audio-transcription-mcp.git
cd audio-transcription-mcp
pip install -e .
# Set up environment
cp .env.example .env
# Edit .env and add your HF_TOKEN
# Run MCP server
python -m audio_transcription_mcp
š§ MCP Client Configuration
GitHub Copilot CLI (Docker Mode)
Add to your mcp.json:
{
"mcpServers": {
"audio-transcription": {
"url": "http://localhost:8080/mcp",
"transport": "streamable-http"
}
}
}
GitHub Copilot CLI (Local Mode)
{
"mcpServers": {
"audio-transcription": {
"command": "python",
"args": ["-m", "audio_transcription_mcp"],
"env": {
"HF_TOKEN": "${HF_TOKEN}",
"OUTPUT_DIR": "./transcriptions"
}
}
}
}
š ļø MCP Tools
transcribe_audio
Transcribe a single audio file with speaker diarization.
transcribe_audio(
file_path="/input/meeting.mp3",
output_dir="/output",
model_size="large-v3",
include_timestamps=True,
generate_summary=True
)
transcribe_directory
Batch transcribe all audio files in a directory.
transcribe_directory(
directory_path="/input",
output_dir="/output",
recursive=False
)
get_transcription_status
Check if an audio file has been transcribed.
get_transcription_status(file_path="/input/meeting.mp3")
š Output Format
Transcriptions are saved as markdown files:
# Audio Transcription: meeting-recording.mp3
## Metadata
- **Source File**: meeting-recording.mp3
- **Duration**: 45:32
- **Speakers Detected**: 3
- **Transcription Date**: 2026-01-29
- **Model**: faster-whisper large-v3
---
## Transcript
### [00:00:00] **Speaker 1**
Good morning everyone. Let's get started with our weekly sync.
### [00:00:05] **Speaker 2**
Thanks for organizing this. I have a few updates on the project.
...
---
## Summary
[AI-generated summary placeholder]
## Key Points
- Point 1 extracted from conversation
- Point 2 extracted from conversation
## Action Items
- [ ] Action item 1 - Assigned to: Speaker 1
- [ ] Action item 2 - Assigned to: Speaker 2
āļø Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
HF_TOKEN |
Hugging Face token (required) | - |
WHISPER_MODEL |
Model size: tiny/base/small/medium/large-v3 | large-v3 |
LANGUAGE |
Transcription language (ISO 639-1) | en |
MAX_FILE_SIZE_GB |
Maximum file size in GB | 1 |
INPUT_DIR |
Input directory for audio files | ./input |
OUTPUT_DIR |
Output directory for transcriptions | ./output |
MCP_TRANSPORT |
Transport mode: stdio/streamable-http | streamable-http |
MCP_PORT |
HTTP port (for streamable-http) | 8080 |
CUDA_VISIBLE_DEVICES |
GPU device ID (-1 for CPU) | 0 |
Model Size Comparison
| Model | Accuracy | Speed | Memory |
|---|---|---|---|
tiny |
ā | Fastest | ~1GB |
base |
āā | Fast | ~1GB |
small |
āāā | Moderate | ~2GB |
medium |
āāāā | Slow | ~5GB |
large-v3 |
āāāāā | Slowest | ~10GB |
š Project Structure
audio-transcription-mcp/
āāā docker/
ā āāā Dockerfile.cpu # CPU container
ā āāā Dockerfile.gpu # GPU container (NVIDIA)
ā āāā docker-compose.yml # Development compose
ā āāā docker-compose.prod.yml # Production compose
ā āāā entrypoint.sh # Container startup
āāā src/
ā āāā audio_transcription_mcp/
ā āāā __init__.py
ā āāā __main__.py # Entry point
ā āāā server.py # MCP server
ā āāā config.py # Configuration
ā āāā audio_processor.py # File handling
ā āāā transcriber.py # Faster-Whisper
ā āāā diarizer.py # pyannote.audio
ā āāā segment_merger.py # Align segments
ā āāā markdown_generator.py
āāā tests/
āāā input/ # Audio files (mount point)
āāā output/ # Transcriptions (mount point)
āāā models/ # Model cache (mount point)
āāā .env.example
āāā pyproject.toml
āāā requirements.txt
š³ Docker Volumes
| Mount Point | Purpose | Mode |
|---|---|---|
/input |
Audio files to transcribe | Read-only |
/output |
Transcription results | Read-write |
/root/.cache |
Model cache (persistent) | Read-write |
ā ļø Known Limitations
- Speaker Diarization: Works best with 2-6 distinct speakers
- Audio Quality: May struggle with background noise, overlapping speech, or phone/video call audio
- Large Files: Files over 30 minutes may take significant processing time
- First Run: Initial model download requires internet connection (~3GB)
š Security
- HF_TOKEN: Store securely, never commit to repository
- Input Validation: Strict file type and size validation
- Path Traversal: All file paths are sanitized
- Container Isolation: Runs with minimal privileges
š License
MIT License - see LICENSE for details.
š Acknowledgments
- Faster-Whisper - Fast Whisper implementation
- pyannote.audio - Speaker diarization
- Model Context Protocol - MCP specification MCP server for audio transcription with speaker diarization. Transcribes MP3/WAV files using Faster-Whisper and pyannote.audio, outputs markdown with speaker labels, timestamps, summaries, and action items. Dockerized for easy deployment (CPU/GPU).
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.