audio-transcription-mcp

audio-transcription-mcp

MCP server for audio transcription with speaker diarization. Transcribes MP3/WAV files using Faster-Whisper and pyannote.audio, outputs markdown with speaker labels, timestamps, summaries, and action items.

Category
Visit Server

README

Audio Transcription MCP

License: MIT Python 3.11+ Docker

MCP (Model Context Protocol) server for audio transcription with speaker diarization. Transcribes MP3/WAV files using Faster-Whisper and pyannote.audio, outputting markdown with speaker labels, timestamps, summaries, and action items.

✨ Features

  • šŸŽ¤ Speaker Diarization - Identifies and labels different speakers (Speaker 1, Speaker 2, etc.)
  • šŸ“ Markdown Output - Clean, formatted transcripts with timestamps
  • 🐳 Docker Ready - CPU and GPU containers for easy deployment
  • šŸš€ MCP Protocol - Integrates with GitHub Copilot CLI and other MCP clients
  • šŸ”’ Offline Capable - Models cached locally after first run
  • ⚔ GPU Acceleration - NVIDIA CUDA support for faster processing

šŸ“‹ Requirements

Prerequisites

  • Python 3.11+ (for local development)
  • Docker (recommended for deployment)
  • Hugging Face Account (free, for model access)
  • NVIDIA GPU + CUDA 12.3 (optional, for GPU acceleration)

Hugging Face Setup (Required)

  1. Create a free account at huggingface.co
  2. Accept model terms:
  3. Generate a token at huggingface.co/settings/tokens

šŸš€ Quick Start

Option 1: Docker (Recommended)

# Clone the repository
git clone https://github.com/ebmarquez/audio-transcription-mcp.git
cd audio-transcription-mcp

# Create .env file with your HF token
echo "HF_TOKEN=hf_your_token_here" > .env

# Build and run with Docker Compose
cd docker
docker compose up -d

# Container is now running at http://localhost:8080/mcp

Option 2: Docker Run (One-Shot)

# CPU version
docker run --rm \
  -e HF_TOKEN="hf_your_token" \
  -v $(pwd)/input:/input:ro \
  -v $(pwd)/output:/output \
  -v $(pwd)/models:/root/.cache \
  -p 8080:8080 \
  audio-transcription-mcp:cpu

# GPU version (NVIDIA)
docker run --rm --gpus all \
  -e HF_TOKEN="hf_your_token" \
  -v $(pwd)/input:/input:ro \
  -v $(pwd)/output:/output \
  -v $(pwd)/models:/root/.cache \
  -p 8080:8080 \
  audio-transcription-mcp:gpu

Option 3: Local Development

# Clone and install
git clone https://github.com/ebmarquez/audio-transcription-mcp.git
cd audio-transcription-mcp
pip install -e .

# Set up environment
cp .env.example .env
# Edit .env and add your HF_TOKEN

# Run MCP server
python -m audio_transcription_mcp

šŸ”§ MCP Client Configuration

GitHub Copilot CLI (Docker Mode)

Add to your mcp.json:

{
  "mcpServers": {
    "audio-transcription": {
      "url": "http://localhost:8080/mcp",
      "transport": "streamable-http"
    }
  }
}

GitHub Copilot CLI (Local Mode)

{
  "mcpServers": {
    "audio-transcription": {
      "command": "python",
      "args": ["-m", "audio_transcription_mcp"],
      "env": {
        "HF_TOKEN": "${HF_TOKEN}",
        "OUTPUT_DIR": "./transcriptions"
      }
    }
  }
}

šŸ› ļø MCP Tools

transcribe_audio

Transcribe a single audio file with speaker diarization.

transcribe_audio(
    file_path="/input/meeting.mp3",
    output_dir="/output",
    model_size="large-v3",
    include_timestamps=True,
    generate_summary=True
)

transcribe_directory

Batch transcribe all audio files in a directory.

transcribe_directory(
    directory_path="/input",
    output_dir="/output",
    recursive=False
)

get_transcription_status

Check if an audio file has been transcribed.

get_transcription_status(file_path="/input/meeting.mp3")

šŸ“„ Output Format

Transcriptions are saved as markdown files:

# Audio Transcription: meeting-recording.mp3

## Metadata
- **Source File**: meeting-recording.mp3
- **Duration**: 45:32
- **Speakers Detected**: 3
- **Transcription Date**: 2026-01-29
- **Model**: faster-whisper large-v3

---

## Transcript

### [00:00:00] **Speaker 1**
Good morning everyone. Let's get started with our weekly sync.

### [00:00:05] **Speaker 2**
Thanks for organizing this. I have a few updates on the project.

...

---

## Summary
[AI-generated summary placeholder]

## Key Points
- Point 1 extracted from conversation
- Point 2 extracted from conversation

## Action Items
- [ ] Action item 1 - Assigned to: Speaker 1
- [ ] Action item 2 - Assigned to: Speaker 2

āš™ļø Configuration

Environment Variables

Variable Description Default
HF_TOKEN Hugging Face token (required) -
WHISPER_MODEL Model size: tiny/base/small/medium/large-v3 large-v3
LANGUAGE Transcription language (ISO 639-1) en
MAX_FILE_SIZE_GB Maximum file size in GB 1
INPUT_DIR Input directory for audio files ./input
OUTPUT_DIR Output directory for transcriptions ./output
MCP_TRANSPORT Transport mode: stdio/streamable-http streamable-http
MCP_PORT HTTP port (for streamable-http) 8080
CUDA_VISIBLE_DEVICES GPU device ID (-1 for CPU) 0

Model Size Comparison

Model Accuracy Speed Memory
tiny ⭐ Fastest ~1GB
base ⭐⭐ Fast ~1GB
small ⭐⭐⭐ Moderate ~2GB
medium ⭐⭐⭐⭐ Slow ~5GB
large-v3 ⭐⭐⭐⭐⭐ Slowest ~10GB

šŸ“ Project Structure

audio-transcription-mcp/
ā”œā”€ā”€ docker/
│   ā”œā”€ā”€ Dockerfile.cpu          # CPU container
│   ā”œā”€ā”€ Dockerfile.gpu          # GPU container (NVIDIA)
│   ā”œā”€ā”€ docker-compose.yml      # Development compose
│   ā”œā”€ā”€ docker-compose.prod.yml # Production compose
│   └── entrypoint.sh           # Container startup
ā”œā”€ā”€ src/
│   └── audio_transcription_mcp/
│       ā”œā”€ā”€ __init__.py
│       ā”œā”€ā”€ __main__.py         # Entry point
│       ā”œā”€ā”€ server.py           # MCP server
│       ā”œā”€ā”€ config.py           # Configuration
│       ā”œā”€ā”€ audio_processor.py  # File handling
│       ā”œā”€ā”€ transcriber.py      # Faster-Whisper
│       ā”œā”€ā”€ diarizer.py         # pyannote.audio
│       ā”œā”€ā”€ segment_merger.py   # Align segments
│       └── markdown_generator.py
ā”œā”€ā”€ tests/
ā”œā”€ā”€ input/                      # Audio files (mount point)
ā”œā”€ā”€ output/                     # Transcriptions (mount point)
ā”œā”€ā”€ models/                     # Model cache (mount point)
ā”œā”€ā”€ .env.example
ā”œā”€ā”€ pyproject.toml
└── requirements.txt

🐳 Docker Volumes

Mount Point Purpose Mode
/input Audio files to transcribe Read-only
/output Transcription results Read-write
/root/.cache Model cache (persistent) Read-write

āš ļø Known Limitations

  • Speaker Diarization: Works best with 2-6 distinct speakers
  • Audio Quality: May struggle with background noise, overlapping speech, or phone/video call audio
  • Large Files: Files over 30 minutes may take significant processing time
  • First Run: Initial model download requires internet connection (~3GB)

šŸ”’ Security

  • HF_TOKEN: Store securely, never commit to repository
  • Input Validation: Strict file type and size validation
  • Path Traversal: All file paths are sanitized
  • Container Isolation: Runs with minimal privileges

šŸ“œ License

MIT License - see LICENSE for details.

šŸ™ Acknowledgments

  • Faster-Whisper - Fast Whisper implementation
  • pyannote.audio - Speaker diarization
  • Model Context Protocol - MCP specification MCP server for audio transcription with speaker diarization. Transcribes MP3/WAV files using Faster-Whisper and pyannote.audio, outputs markdown with speaker labels, timestamps, summaries, and action items. Dockerized for easy deployment (CPU/GPU).

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured