tts-audio-mcp
MCP server that analyzes TTS audio recordings, providing transcription, quality scoring, pacing analysis, and mispronunciation detection to debug voice call center audio.
README
tts-audio-mcp
MCP server that analyzes TTS audio recordings — enabling Claude Code, OpenCode, and Qwen Code to debug voice call center audio the same way they debug code errors.
Feed it an audio file, get back a structured report with transcription, quality scores, pacing analysis, and mispronunciation detection.
What It Does
Audio File (.wav/.mp3/.m4a)
|
v
+-------------------------------+
| tts-audio-mcp server |
| |
| transcribe -> whisper.cpp |
| quality_score -> librosa |
| compare_tts -> whisper+diff |
| analyze_tts -> all combined |
| |
| Transport: stdio (MCP) |
+-------------------------------+
|
v
Structured report -> LLM reasons about fixes
Tools
transcribe
Transcribe audio to text with word-level timestamps.
Input: audio_path (string), language (string, default: "en")
Output: Full transcription text, per-word timestamps (start_ms, end_ms), segments, detected language, duration.
quality_score
Analyze speech quality — pitch variation, energy dynamics, silence ratio.
Input: audio_path (string)
Output:
- Pitch: mean/std/range Hz, monotone risk flag, interpretation
- Energy: RMS level, dynamic range dB, interpretation
- Silence ratio: percentage of audio that is silent
- Overall assessment with list of detected issues
compare_tts
Compare TTS output against expected text to find mispronunciations.
Input: audio_path (string), expected_text (string), language (string, default: "en")
Output: Word Error Rate (WER), substitutions, insertions, deletions with positions.
analyze_tts
Full composite analysis — runs all of the above and returns a single structured report.
Input: audio_path (string), expected_text (string, optional), language (string, default: "en")
Output: Combined report with transcription, quality scores, pacing analysis (WPM, rushed words, long pauses), and pronunciation diff.
Example Output
TTS Analysis Report
File: /tmp/tts-test-speech.wav
Duration: 4.15s | Words: 13 | Rate: 195 WPM
--- Transcription ---
Hello. Thank you for calling Acme Support. How can I help you today?
--- Quality Scores ---
Pitch: mean 258.5Hz, std 61.1Hz, range 264.3Hz
Good variation — expressive
Energy: RMS 0.0912, dynamic range 80dB
Wide dynamic range
Silence ratio: 37.3%
Overall: Minor issues detected (1)
! Very wide dynamic range — may clip
--- Pacing Analysis ---
Speaking rate: 195 WPM (natural: 120-180)
Minor pacing issues
Rushed words:
"How" spoken in 40ms
! Speaking rate too fast: 195 WPM (natural: 120-180)
! 1 rushed word(s) detected (<80ms)
--- Pronunciation Check ---
Expected: Hello. Thank you for calling Acme Support. How can I help you today?
Got: Hello. Thank you for calling Acme Support. How can I help you today?
WER: 0.0%
Perfect match — no mispronunciations detected
--- Issues Summary ---
1. Very wide dynamic range — may clip
2. Speaking rate too fast: 195 WPM (natural: 120-180)
3. 1 rushed word(s) detected (<80ms)
Prerequisites
- whisper.cpp with Metal acceleration:
brew install whisper-cpp - Whisper model:
ggml-large-v3-turbo.bin(~1.5 GB) inmodels/ - Python 3.12 with librosa:
.venv/bin/python3withpip install librosa - ffmpeg for audio format conversion:
brew install ffmpeg - Node.js 18+
Installation
git clone https://github.com/reactiongears/tts-audio-mcp.git
cd tts-audio-mcp
# Node dependencies
npm install
# Python venv for audio analysis
python3.12 -m venv .venv
.venv/bin/pip install librosa 'setuptools<82'
# Download whisper model
mkdir -p models
curl -L -o models/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
# Build
npm run build
Integration
Claude Code
Add to ~/.claude/.mcp.json:
{
"mcpServers": {
"tts-audio": {
"command": "node",
"args": ["/path/to/tts-audio-mcp/dist/index.js"]
}
}
}
OpenCode
Add to ~/.config/opencode/opencode.json under "mcp":
"tts-audio": {
"type": "local",
"command": ["node", "/path/to/tts-audio-mcp/dist/index.js"],
"enabled": true
}
Qwen Code
Add to ~/.qwen/settings.json under "mcpServers":
"tts-audio": {
"command": "node",
"args": ["/path/to/tts-audio-mcp/dist/index.js"]
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
WHISPER_BINARY |
whisper-cli |
Path to whisper.cpp binary |
WHISPER_MODEL_PATH |
~/Documents/_dev/tts-audio-mcp/models/ggml-large-v3-turbo.bin |
Path to Whisper model file |
TTS_PYTHON_BIN |
.venv/bin/python3 |
Python binary with librosa installed |
Usage
Once the MCP server is configured in your coding assistant, the tools are available automatically. You talk to your assistant in natural language — it decides when to call the tools and interprets the results for you.
Quick Start
Generate a test audio file to try it out:
# macOS — use the built-in TTS engine
say -o /tmp/test-greeting.wav --data-format=LEI16@16000 \
"Hello. Thank you for calling Acme Support. How can I help you today?"
Then in Claude Code, OpenCode, or Qwen Code:
> Analyze the audio at /tmp/test-greeting.wav
The assistant calls analyze_tts behind the scenes and returns a full report with transcription, quality scores, pacing, and issues.
Debugging TTS Problems
"It sounds robotic" — Check pitch variation and monotone risk:
> Run quality_score on /recordings/agent-greeting.wav — customers say it sounds robotic
The report shows pitch std < 20 Hz = monotone risk. You know to increase prosody variation in your TTS config.
"Words are getting swallowed" — Compare against expected script:
> Compare /recordings/transfer-prompt.wav against the expected text:
> "Thank you for your patience. I'll transfer you to a specialist now."
The tool transcribes the audio, diffs it against your script, and reports substitutions ("specialist" → "specialist's"), deletions, and WER. You know exactly which words the TTS is mangling.
"It's talking too fast / has weird pauses" — Check pacing:
> Analyze /recordings/ivr-menu.wav — callers are complaining it's too fast
The report flags speaking rate (natural range: 120-180 WPM), individual rushed words (< 80ms), and unnatural pauses (> 500ms). You know where to add SSML breaks or adjust rate.
"Something is off but I'm not sure what" — Full analysis:
> Run a full analysis on /recordings/hold-message.wav
> The expected text is: "Your call is important to us. Please hold and an agent will be with you shortly."
Returns everything: transcription, quality metrics, pacing analysis, pronunciation diff, and a prioritized issues summary.
Batch Debugging
You can analyze multiple recordings in a conversation:
> Compare these three recordings against their scripts and tell me which one has the most issues:
> 1. /recordings/greeting.wav — "Welcome to Acme Support"
> 2. /recordings/hold.wav — "Please hold while I look that up"
> 3. /recordings/goodbye.wav — "Thank you for calling. Have a great day!"
The assistant calls compare_tts for each file and summarizes which recordings need attention.
Using Individual Tools
You can also ask for specific analysis:
| What you want | What to ask |
|---|---|
| Just the transcription | "Transcribe /path/to/audio.wav" |
| Just quality metrics | "Check the audio quality of /path/to/audio.wav" |
| Just pronunciation accuracy | "Compare /path/to/audio.wav against 'expected text here'" |
| Everything at once | "Full TTS analysis on /path/to/audio.wav" |
Supported Audio Formats
.wav— processed directly (best performance).mp3— auto-converted to WAV via ffmpeg.m4a— auto-converted to WAV via ffmpeg
Interpreting Results
Quality Scores:
| Metric | Good | Concerning |
|---|---|---|
| Pitch std | 25-80 Hz (natural variation) | < 15 Hz (monotone/robotic) |
| Dynamic range | 10-50 dB | < 10 dB (flat) or > 70 dB (may clip) |
| Silence ratio | 10-50% | > 50% (too much dead air) or < 10% (no breathing room) |
Pacing:
| Metric | Natural range | Flag |
|---|---|---|
| Speaking rate | 120-180 WPM | Outside range |
| Word duration | > 80ms | < 80ms = rushed |
| Inter-word gap | < 500ms | > 500ms = unnatural pause |
Pronunciation (WER):
| WER | Interpretation |
|---|---|
| 0% | Perfect match |
| 1-5% | Minor issues (articles, contractions) |
| 5-15% | Noticeable mispronunciations |
| > 15% | Significant problems |
Real-World Workflow
A typical voice call center debugging session:
- Customer reports: "The bot sounds weird when it says the account number"
- You pull the call recording:
/recordings/call-1234-segment.wav - You know the expected script:
"Your account number is 7 8 4 2 0 1 3" - In Claude Code:
> Compare /recordings/call-1234-segment.wav against "Your account number is 7 8 4 2 0 1 3" > What's wrong and how should I fix the TTS config? - Claude calls
compare_tts, sees the TTS is running digits together ("seven eight four" → "seventy-eight four"), and suggests adding SSML<say-as interpret-as="digits">tags or inter-digit pauses to your TTS configuration
The LLM doesn't just report the numbers — it reasons about the root cause and suggests specific fixes to your TTS code or configuration.
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.