MCP Server Whisper

MCP Server Whisper

An MCP Server for audio transcription using OpenAI

arcaputo3

Research & Data
Visit Server

README

MCP Server Whisper

<div align="center">

A Model Context Protocol (MCP) server for advanced audio transcription and processing using OpenAI's Whisper and GPT-4o models.

License: MIT Python 3.10+ CI Status Built with uv

</div>

Overview

MCP Server Whisper provides a standardized way to process audio files through OpenAI's latest transcription and speech services. By implementing the Model Context Protocol, it enables AI assistants like Claude to seamlessly interact with audio processing capabilities.

Key features:

  • 🔍 Advanced file searching with regex patterns, file metadata filtering, and sorting capabilities
  • 🔄 Parallel batch processing for multiple audio files
  • 🔄 Format conversion between supported audio types
  • 📦 Automatic compression for oversized files
  • 🎯 Multi-model transcription with support for all OpenAI audio models
  • 🗣️ Interactive audio chat with GPT-4o audio models
  • ✏️ Enhanced transcription with specialized prompts and timestamp support
  • 🎙️ Text-to-speech generation with customizable voices, instructions, and speed
  • 📊 Comprehensive metadata including duration, file size, and format support
  • 🚀 High-performance caching for repeated operations

Installation

# Clone the repository
git clone https://github.com/arcaputo3/mcp-server-whisper.git
cd mcp-server-whisper

# Using uv 
uv sync

# Set up pre-commit hooks
uv run pre-commit install

Environment Setup

Create a .env file with the following variables:

OPENAI_API_KEY=your_openai_api_key
AUDIO_FILES_PATH=/path/to/your/audio/files

Usage

Starting the Server

To run the MCP server in development mode:

mcp dev src/mcp_server_whisper/server.py

To install the server for use with Claude Desktop or other MCP clients:

mcp install src/mcp_server_whisper/server.py [--env-file .env]

Exposed MCP Tools

Audio File Management

  • list_audio_files - Lists audio files with comprehensive filtering and sorting options:
    • Filter by regex pattern matching on filenames
    • Filter by file size, duration, modification time, or format
    • Sort by name, size, duration, modification time, or format
    • All operations support parallelized batch processing
  • get_latest_audio - Gets the most recently modified audio file with model support info

Audio Processing

  • convert_audio - Converts audio files to supported formats (mp3 or wav)
  • compress_audio - Compresses audio files that exceed size limits

Transcription

  • transcribe_audio - Advanced transcription using OpenAI's models:

    • Supports whisper-1, gpt-4o-transcribe, and gpt-4o-mini-transcribe
    • Custom prompts for guided transcription
    • Optional timestamp granularities for word and segment-level timing
    • JSON response format option
  • chat_with_audio - Interactive audio analysis using GPT-4o audio models:

    • Supports gpt-4o-audio-preview-2024-10-01, gpt-4o-audio-preview-2024-12-17, and gpt-4o-mini-audio-preview-2024-12-17
    • Custom system and user prompts
    • Provides conversational responses to audio content
  • transcribe_with_enhancement - Enhanced transcription with specialized templates:

    • detailed - Includes tone, emotion, and background details
    • storytelling - Transforms the transcript into a narrative form
    • professional - Creates formal, business-appropriate transcriptions
    • analytical - Adds analysis of speech patterns and key points

Text-to-Speech

  • create_claudecast - Generate text-to-speech audio using OpenAI's TTS API:
    • Supports gpt-4o-mini-tts (preferred) and other speech models
    • Multiple voice options (alloy, ash, coral, echo, fable, onyx, nova, sage, shimmer)
    • Speed adjustment and custom instructions
    • Customizable output file paths
    • Handles texts of any length by automatically splitting and joining audio segments

Supported Audio Formats

Model Supported Formats
Transcribe flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
Chat mp3, wav

Note: Files larger than 25MB are automatically compressed to meet API limits.

Example Usage with Claude

<details> <summary>Basic Audio Transcription</summary>

Claude, please transcribe my latest audio file with detailed insights.

Claude will automatically:

  1. Find the latest audio file using get_latest_audio
  2. Determine the appropriate transcription method
  3. Process the file with transcribe_with_enhancement using the "detailed" template
  4. Return the enhanced transcription </details>

<details> <summary>Advanced Audio File Search and Filtering</summary>

Claude, list all my audio files that are longer than 5 minutes and were created after January 1st, 2024, sorted by size.

Claude will:

  1. Convert the date to a timestamp
  2. Use list_audio_files with appropriate filters:
    • min_duration_seconds: 300 (5 minutes)
    • min_modified_time: <timestamp for Jan 1, 2024>
    • sort_by: "size"
  3. Return a sorted list of matching audio files with comprehensive metadata </details>

<details> <summary>Batch Processing Multiple Files</summary>

Claude, find all MP3 files with "interview" in the filename and create professional transcripts for each one.

Claude will:

  1. Search for files using list_audio_files with:
    • pattern: ".*interview.*\\.mp3"
    • format: "mp3"
  2. Process all matching files in parallel using transcribe_with_enhancement
    • enhancement_type: "professional"
    • model: "gpt-4o-mini-transcribe" (for efficiency)
  3. Return all transcriptions in a well-formatted output </details>

<details> <summary>Generating Text-to-Speech with Claudecast</summary>

Claude, create a claudecast with this script: "Welcome to our podcast! Today we'll be discussing artificial intelligence trends in 2025." Use the shimmer voice.

Claude will:

  1. Use the create_claudecast tool with:
    • text_prompt containing the script
    • voice: "shimmer"
    • model: "gpt-4o-mini-tts" (default high-quality model)
    • instructions: "Speak in an enthusiastic, podcast host style" (optional)
    • speed: 1.0 (default, can be adjusted)
  2. Generate the audio file and save it to the configured audio directory
  3. Provide the path to the generated audio file </details>

Configuration with Claude Desktop

Add this to your claude_desktop_config.json:

UVX

{
  "mcpServers": {
    "whisper": {
      "command": "uvx",
      "args": [
        "--with",
        "aiofiles",
        "--with",
        "mcp[cli]",
        "--with",
        "openai",
        "--with",
        "pydub",
        "mcp-server-whisper"
      ],
      "env": {
        "OPENAI_API_KEY": "your_openai_api_key",
        "AUDIO_FILES_PATH": "/path/to/your/audio/files"
      }
    }
  }
}

Recommendation (Mac OS Only)

  • Install Screen Recorder By Omi (free)
  • Set AUDIO_FILES_PATH to /Users/<user>/Movies/Omi Screen Recorder and replace <user> with your username
  • As you record audio with the app, you can transcribe large batches directly with Claude

Development

This project uses modern Python development tools including uv, pytest, ruff, and mypy.

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=src

# Format code
uv run ruff format src

# Lint code
uv run ruff check src

# Run type checking (strict mode)
uv run mypy --strict src

# Run the pre-commit hooks
pre-commit run --all-files

CI/CD Workflow

The project uses GitHub Actions for CI/CD:

  1. Lint & Type Check: Ensures code quality with ruff and strict mypy type checking
  2. Tests: Runs tests on multiple Python versions (3.10, 3.11)
  3. Build: Creates distribution packages
  4. Publish: Automatically publishes to PyPI when a new version tag is pushed

To create a new release version:

git checkout main
# Make sure everything is up to date
git pull
# Create a new version tag
git tag v0.1.1
# Push the tag
git push origin v0.1.1

How It Works

For detailed architecture information, see Architecture Documentation.

MCP Server Whisper is built on the Model Context Protocol, which standardizes how AI models interact with external tools and data sources. The server:

  1. Exposes Audio Processing Capabilities: Through standardized MCP tool interfaces
  2. Implements Parallel Processing: Using asyncio and batch operations for performance
  3. Manages File Operations: Handles detection, validation, conversion, and compression
  4. Provides Rich Transcription: Via different OpenAI models and enhancement templates
  5. Optimizes Performance: With caching mechanisms for repeated operations

Under the hood, it uses:

  • pydub for audio file manipulation
  • asyncio for concurrent processing
  • OpenAI's latest transcription models (including gpt-4o-transcribe)
  • OpenAI's GPT-4o audio models for enhanced understanding
  • OpenAI's gpt-4o-mini-tts for high-quality speech synthesis
  • FastMCP for simplified MCP server implementation
  • Type hints and strict mypy validation throughout the codebase

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a new branch for your feature (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run the tests and linting (uv run pytest && uv run ruff check src && uv run mypy --strict src)
  5. Commit your changes (git commit -m 'Add some amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments


<div align="center"> Made with ❤️ by <a href="https://github.com/arcaputo3">Richie Caputo</a> </div>

Recommended Servers

Crypto Price & Market Analysis MCP Server

Crypto Price & Market Analysis MCP Server

A Model Context Protocol (MCP) server that provides comprehensive cryptocurrency analysis using the CoinCap API. This server offers real-time price data, market analysis, and historical trends through an easy-to-use interface.

Featured
TypeScript
MCP PubMed Search

MCP PubMed Search

Server to search PubMed (PubMed is a free, online database that allows users to search for biomedical and life sciences literature). I have created on a day MCP came out but was on vacation, I saw someone post similar server in your DB, but figured to post mine.

Featured
Python
dbt Semantic Layer MCP Server

dbt Semantic Layer MCP Server

A server that enables querying the dbt Semantic Layer through natural language conversations with Claude Desktop and other AI assistants, allowing users to discover metrics, create queries, analyze data, and visualize results.

Featured
TypeScript
mixpanel

mixpanel

Connect to your Mixpanel data. Query events, retention, and funnel data from Mixpanel analytics.

Featured
TypeScript
Sequential Thinking MCP Server

Sequential Thinking MCP Server

This server facilitates structured problem-solving by breaking down complex issues into sequential steps, supporting revisions, and enabling multiple solution paths through full MCP integration.

Featured
Python
Nefino MCP Server

Nefino MCP Server

Provides large language models with access to news and information about renewable energy projects in Germany, allowing filtering by location, topic (solar, wind, hydrogen), and date range.

Official
Python
Vectorize

Vectorize

Vectorize MCP server for advanced retrieval, Private Deep Research, Anything-to-Markdown file extraction and text chunking.

Official
JavaScript
Mathematica Documentation MCP server

Mathematica Documentation MCP server

A server that provides access to Mathematica documentation through FastMCP, enabling users to retrieve function documentation and list package symbols from Wolfram Mathematica.

Local
Python
kb-mcp-server

kb-mcp-server

An MCP server aimed to be portable, local, easy and convenient to support semantic/graph based retrieval of txtai "all in one" embeddings database. Any txtai embeddings db in tar.gz form can be loaded

Local
Python
Research MCP Server

Research MCP Server

The server functions as an MCP server to interact with Notion for retrieving and creating survey data, integrating with the Claude Desktop Client for conducting and reviewing surveys.

Local
Python