MCP Servers

tts-audio-mcp

MCP server that analyzes TTS audio recordings, providing transcription, quality scoring, pacing analysis, and mispronunciation detection to debug voice call center audio.

README

tts-audio-mcp

MCP server that analyzes TTS audio recordings — enabling Claude Code, OpenCode, and Qwen Code to debug voice call center audio the same way they debug code errors.

Feed it an audio file, get back a structured report with transcription, quality scores, pacing analysis, and mispronunciation detection.

What It Does

Audio File (.wav/.mp3/.m4a)
       |
       v
+-------------------------------+
|     tts-audio-mcp server      |
|                               |
|  transcribe   -> whisper.cpp  |
|  quality_score -> librosa     |
|  compare_tts  -> whisper+diff |
|  analyze_tts  -> all combined |
|                               |
|  Transport: stdio (MCP)       |
+-------------------------------+
       |
       v
  Structured report -> LLM reasons about fixes

Tools

`transcribe`

Transcribe audio to text with word-level timestamps.

Input: audio_path (string), language (string, default: "en")

Output: Full transcription text, per-word timestamps (start_ms, end_ms), segments, detected language, duration.

`quality_score`

Analyze speech quality — pitch variation, energy dynamics, silence ratio.

Input: audio_path (string)

Output:

Pitch: mean/std/range Hz, monotone risk flag, interpretation
Energy: RMS level, dynamic range dB, interpretation
Silence ratio: percentage of audio that is silent
Overall assessment with list of detected issues

`compare_tts`

Compare TTS output against expected text to find mispronunciations.

Input: audio_path (string), expected_text (string), language (string, default: "en")

Output: Word Error Rate (WER), substitutions, insertions, deletions with positions.

`analyze_tts`

Full composite analysis — runs all of the above and returns a single structured report.

Input: audio_path (string), expected_text (string, optional), language (string, default: "en")

Output: Combined report with transcription, quality scores, pacing analysis (WPM, rushed words, long pauses), and pronunciation diff.

Example Output

TTS Analysis Report
File: /tmp/tts-test-speech.wav
Duration: 4.15s | Words: 13 | Rate: 195 WPM

--- Transcription ---
Hello. Thank you for calling Acme Support. How can I help you today?

--- Quality Scores ---
Pitch: mean 258.5Hz, std 61.1Hz, range 264.3Hz
  Good variation — expressive
Energy: RMS 0.0912, dynamic range 80dB
  Wide dynamic range
Silence ratio: 37.3%
Overall: Minor issues detected (1)
  ! Very wide dynamic range — may clip

--- Pacing Analysis ---
Speaking rate: 195 WPM (natural: 120-180)
Minor pacing issues
Rushed words:
  "How" spoken in 40ms
  ! Speaking rate too fast: 195 WPM (natural: 120-180)
  ! 1 rushed word(s) detected (<80ms)

--- Pronunciation Check ---
Expected: Hello. Thank you for calling Acme Support. How can I help you today?
Got:      Hello. Thank you for calling Acme Support. How can I help you today?
WER: 0.0%
Perfect match — no mispronunciations detected

--- Issues Summary ---
1. Very wide dynamic range — may clip
2. Speaking rate too fast: 195 WPM (natural: 120-180)
3. 1 rushed word(s) detected (<80ms)

Prerequisites

whisper.cpp with Metal acceleration: brew install whisper-cpp
Whisper model: ggml-large-v3-turbo.bin (~1.5 GB) in models/
Python 3.12 with librosa: .venv/bin/python3 with pip install librosa
ffmpeg for audio format conversion: brew install ffmpeg
Node.js 18+

Installation

git clone https://github.com/reactiongears/tts-audio-mcp.git
cd tts-audio-mcp

# Node dependencies
npm install

# Python venv for audio analysis
python3.12 -m venv .venv
.venv/bin/pip install librosa 'setuptools<82'

# Download whisper model
mkdir -p models
curl -L -o models/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

# Build
npm run build

Integration

Claude Code

Add to ~/.claude/.mcp.json:

{
  "mcpServers": {
    "tts-audio": {
      "command": "node",
      "args": ["/path/to/tts-audio-mcp/dist/index.js"]
    }
  }
}

OpenCode

Add to ~/.config/opencode/opencode.json under "mcp":

"tts-audio": {
  "type": "local",
  "command": ["node", "/path/to/tts-audio-mcp/dist/index.js"],
  "enabled": true
}

Qwen Code

Add to ~/.qwen/settings.json under "mcpServers":

"tts-audio": {
  "command": "node",
  "args": ["/path/to/tts-audio-mcp/dist/index.js"]
}

Environment Variables

Variable	Default	Description
`WHISPER_BINARY`	`whisper-cli`	Path to whisper.cpp binary
`WHISPER_MODEL_PATH`	`~/Documents/_dev/tts-audio-mcp/models/ggml-large-v3-turbo.bin`	Path to Whisper model file
`TTS_PYTHON_BIN`	`.venv/bin/python3`	Python binary with librosa installed

Usage

Once the MCP server is configured in your coding assistant, the tools are available automatically. You talk to your assistant in natural language — it decides when to call the tools and interprets the results for you.

Quick Start

Generate a test audio file to try it out:

# macOS — use the built-in TTS engine
say -o /tmp/test-greeting.wav --data-format=LEI16@16000 \
  "Hello. Thank you for calling Acme Support. How can I help you today?"

Then in Claude Code, OpenCode, or Qwen Code:

> Analyze the audio at /tmp/test-greeting.wav

The assistant calls analyze_tts behind the scenes and returns a full report with transcription, quality scores, pacing, and issues.

Debugging TTS Problems

"It sounds robotic" — Check pitch variation and monotone risk:

> Run quality_score on /recordings/agent-greeting.wav — customers say it sounds robotic

The report shows pitch std < 20 Hz = monotone risk. You know to increase prosody variation in your TTS config.

"Words are getting swallowed" — Compare against expected script:

> Compare /recordings/transfer-prompt.wav against the expected text:
> "Thank you for your patience. I'll transfer you to a specialist now."

The tool transcribes the audio, diffs it against your script, and reports substitutions ("specialist" → "specialist's"), deletions, and WER. You know exactly which words the TTS is mangling.

"It's talking too fast / has weird pauses" — Check pacing:

> Analyze /recordings/ivr-menu.wav — callers are complaining it's too fast

The report flags speaking rate (natural range: 120-180 WPM), individual rushed words (< 80ms), and unnatural pauses (> 500ms). You know where to add SSML breaks or adjust rate.

"Something is off but I'm not sure what" — Full analysis:

> Run a full analysis on /recordings/hold-message.wav
> The expected text is: "Your call is important to us. Please hold and an agent will be with you shortly."

Returns everything: transcription, quality metrics, pacing analysis, pronunciation diff, and a prioritized issues summary.

Batch Debugging

You can analyze multiple recordings in a conversation:

> Compare these three recordings against their scripts and tell me which one has the most issues:
> 1. /recordings/greeting.wav — "Welcome to Acme Support"
> 2. /recordings/hold.wav — "Please hold while I look that up"
> 3. /recordings/goodbye.wav — "Thank you for calling. Have a great day!"

The assistant calls compare_tts for each file and summarizes which recordings need attention.

Using Individual Tools

You can also ask for specific analysis:

What you want	What to ask
Just the transcription	"Transcribe /path/to/audio.wav"
Just quality metrics	"Check the audio quality of /path/to/audio.wav"
Just pronunciation accuracy	"Compare /path/to/audio.wav against 'expected text here'"
Everything at once	"Full TTS analysis on /path/to/audio.wav"

Supported Audio Formats

.wav — processed directly (best performance)
.mp3 — auto-converted to WAV via ffmpeg
.m4a — auto-converted to WAV via ffmpeg

Interpreting Results

Quality Scores:

Metric	Good	Concerning
Pitch std	25-80 Hz (natural variation)	< 15 Hz (monotone/robotic)
Dynamic range	10-50 dB	< 10 dB (flat) or > 70 dB (may clip)
Silence ratio	10-50%	> 50% (too much dead air) or < 10% (no breathing room)

Pacing:

Metric	Natural range	Flag
Speaking rate	120-180 WPM	Outside range
Word duration	> 80ms	< 80ms = rushed
Inter-word gap	< 500ms	> 500ms = unnatural pause

Pronunciation (WER):

WER	Interpretation
0%	Perfect match
1-5%	Minor issues (articles, contractions)
5-15%	Noticeable mispronunciations
> 15%	Significant problems

Real-World Workflow

A typical voice call center debugging session:

Customer reports: "The bot sounds weird when it says the account number"
You pull the call recording: /recordings/call-1234-segment.wav
You know the expected script: "Your account number is 7 8 4 2 0 1 3"

In Claude Code:

> Compare /recordings/call-1234-segment.wav against "Your account number is 7 8 4 2 0 1 3"
> What's wrong and how should I fix the TTS config?

Claude calls compare_tts, sees the TTS is running digits together ("seven eight four" → "seventy-eight four"), and suggests adding SSML <say-as interpret-as="digits"> tags or inter-digit pauses to your TTS configuration

The LLM doesn't just report the numbers — it reasons about the root cause and suggests specific fixes to your TTS code or configuration.

License

MIT

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured