Speech-to-Text MCP Server
An MCP server that provides speech-to-text transcription and speaker diarization using OpenAI Whisper and pyannote.audio.
README
Speech-to-Text MCP Server
一个支持语音转文本和说话人分离的 Model Context Protocol (MCP) 服务器。
功能特性
- ✅ 支持上传最长 60 分钟的音频文件
- ✅ 自动语音识别 (使用 OpenAI Whisper)
- ✅ 说话人分离功能 (使用 pyannote.audio)
- ✅ 支持多种音频格式 (mp3, wav, m4a, flac 等)
- ✅ 输出带时间戳的转录文本
系统要求
- Python 3.10 或更高版本
- Windows/Linux/macOS
- 至少 8GB RAM (推荐 16GB)
- 建议使用 GPU 以加速处理 (可选)
安装步骤
1. 创建虚拟环境
python -m venv venv
.\venv\Scripts\Activate.ps1
2. 安装依赖
pip install -e .
3. 安装 FFmpeg (音频处理必需)
Windows:
- 下载 FFmpeg: https://ffmpeg.org/download.html
- 解压并添加到系统 PATH
或使用 Chocolatey:
choco install ffmpeg
4. 获取 Hugging Face Token
pyannote.audio 需要 Hugging Face token:
- 访问 https://huggingface.co/settings/tokens
- 创建一个新的 token
- 接受模型使用条款:
- https://huggingface.co/pyannote/speaker-diarization-3.1
- https://huggingface.co/pyannote/segmentation-3.0
配置 MCP 客户端
在你的 MCP 客户端配置文件中添加此服务器:
{
"mcpServers": {
"speech-to-text": {
"command": "python",
"args": [
"e:\\demoProject\\speech_to_text\\server.py"
],
"env": {
"HUGGINGFACE_TOKEN": "your_token_here"
}
}
}
}
使用方法
服务器提供以下工具:
1. transcribe_audio
转录音频文件为文本。
参数:
audio_file_path(必需): 音频文件的完整路径language(可选): 语言代码,如 "zh" (中文), "en" (英文), 默认自动检测enable_diarization(可选): 是否启用说话人分离,默认 false
示例调用:
{
"audio_file_path": "C:\\Users\\username\\Desktop\\meeting.mp3",
"language": "zh",
"enable_diarization": true
}
2. get_supported_formats
获取支持的音频格式列表。
输出示例
不启用说话人分离:
[00:00:00.000 --> 00:00:05.000] 大家好,欢迎参加今天的会议。
[00:00:05.000 --> 00:00:12.000] 今天我们主要讨论项目进展。
启用说话人分离:
[说话人 SPEAKER_00] [00:00:00.000 --> 00:00:05.000]
大家好,欢迎参加今天的会议。
[说话人 SPEAKER_01] [00:00:05.000 --> 00:00:12.000]
今天我们主要讨论项目进展。
技术细节
- 语音识别引擎: OpenAI Whisper (medium 模型)
- 说话人分离: pyannote.audio 3.1
- 音频处理: pydub + FFmpeg
- MCP 版本: 1.0
性能建议
- 对于长音频 (>30分钟),建议使用 GPU
- 首次运行会下载模型文件 (~1.5GB)
- 处理时间大约为音频时长的 0.5-2 倍 (取决于硬件)
故障排除
问题: FFmpeg 未找到
解决: 确保 FFmpeg 已安装并在 PATH 中
问题: CUDA out of memory
解决: 使用 CPU 模式或减少 batch size
问题: pyannote.audio 认证失败
解决: 检查 HUGGINGFACE_TOKEN 是否正确设置并接受了模型条款
许可证
MIT License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.