Kokoro MCP Server
Provides text-to-speech generation using the Kokoro-82M model, enabling AI assistants to generate voiceovers and audio content directly within Claude Desktop and Cursor.
README
Kokoro MCP SERVER: Text To Speech (TTS)
A comprehensive Text-to-Speech toolkit built on Kokoro-82M with audio enhancement, Model Context Protocol (MCP) server integration, CLI interface, and Docker deployment.
šŗ Demo Video
Features
- Kokoro-82M TTS Engine: Open-weight model with 82M parameters (510 tokens per pass)
- š Streamlit Web UI: Enterprise-grade management interface with real-time preview (OPTIONAL)
- Audio Enhancement: Professional processing with librosa (normalization, noise reduction, fade in/out)
- MCP Server: Model Context Protocol integration for Claude Desktop, Cursor, and other AI tools (OPTIONAL)
- CLI Interface: Command-line tools for quick generation (OPTIONAL)
- Batch Processing: Generate multiple audio files efficiently
- Script Processing: Convert complete video scripts with automatic text chunking
- Docker Support: Containerization with docker-compose
- Enterprise Features: Structured logging, configuration management, comprehensive testing
- CI/CD: GitHub Actions pipeline with automated testing
Streamlit Web Interface (Optional)
Beautiful web UI for managing all TTS functionality:
- šÆ Single Generation - Convert text with real-time preview
- š¦ Batch Processing - Process multiple texts in one go
- š Script Processing - Complete video script conversion
- š Voice Explorer - Compare all 12 voices side-by-side
- āļø Configuration - Manage settings visually
- š Analytics - Track generations with charts and statistics
Install with: pip install -e ".[streamlit]" or pip install -e ".[complete]"
Quick Start:
python run_streamlit.py
# Opens at http://localhost:8501
š See STREAMLIT_README.md for complete Streamlit documentation.
Building on Kokoro-82M
What Kokoro-82M Provides Out-of-the-Box: Kokoro-82M is an exceptional open-weight TTS model that delivers: core neural TTS inference with 82M parameters, a basic Python inference library (KPipeline), 10 professional voice packs (male/female, American/British), phonemization (G2P) system, and raw 24kHz audio output with a 510-token processing limit per pass.
What aparsoft-tts Adds: We integrate Kokoro-82M's excellent TTS inference with comprehensive development tooling and workflow enhancements. This toolkit adds:
- Audio post-processing - Normalization, noise reduction, silence trimming, and fade in/out using librosa
- Automated script workflows - Direct script-to-voiceover conversion with paragraph detection and gap management
- IDE-native generation - MCP server integration eliminates context switching for Claude Desktop and Cursor users
- Deployment infrastructure - Docker deployment, structured logging, configuration management, and comprehensive testing
- Batch processing - CLI and Python APIs for processing multiple segments efficiently
Technical Implementation
Audio Enhancement (librosa Integration):
This toolkit adds an audio processing pipeline on Kokoro generated TTS output:
# Without enhancement - raw Kokoro output
audio = kokoro_pipeline(text)
# With enhancement
audio = enhance_audio(
kokoro_output,
normalize=True, # Consistent volume
trim_silence=True, # Remove dead air
noise_reduction=True, # Spectral gating
add_fade=True # Smooth transitions
)
Result: Voiceovers ready for YouTube, podcasts, or content creation without additional audio editing.
MCP Server Integration:
Traditional workflow:
# 1. Write script in Claude/Cursor
# 2. Copy text to terminal
# 3. Run Python script
# 4. Switch back to editor
# 5. Repeat for each segment
With MCP server:
# In Claude Desktop or Cursor:
"Generate voiceover for this section using am_michael voice"
# Done. Audio generated without leaving your workspace.
Workflow Enhancement:
- Content creators: Write scripts in AI editors, generate voiceovers inline
- Developers: Generate test audio during development without context switching
- Teams: Standardized TTS across tools (Claude, Cursor, CLI, API)
- Automation: AI agents can generate audio as part of content pipelines
Deployment Features:
The toolkit wraps Kokoro with common deployment and development needs:
- Configuration management - Environment-based settings, no hardcoded values
- Structured logging - JSON logs for aggregation, correlation IDs for tracing
- Error handling - Custom exceptions, graceful failures, detailed error context
- Testing - Comprehensive test suite, CI/CD integration
- Docker deployment - Containerized with health checks, resource limits
- CLI interface - Quick access without writing code
Use Cases
YouTube/Podcast Production:
# Process entire video script with proper gaps
engine.process_script("script.txt", "voiceover.wav", gap_duration=0.5)
AI-Assisted Content Creation:
# In Claude Desktop with MCP:
User: "Generate a 30-second intro for my coding tutorial"
Claude: *generates script and voiceover via MCP*
Batch Content Generation:
# Generate 100 audio segments for e-learning course
engine.batch_generate(lesson_texts, output_dir="lessons/")
Development/Testing:
# Quick CLI test during development
aparsoft-tts generate "Test message" -o test.wav
Quick Start
Installation
System Dependencies (Required):
# Ubuntu/Debian
sudo apt-get install espeak-ng ffmpeg libsndfile1
# macOS
brew install espeak ffmpeg
# Windows: Download from
# - espeak-ng: http://espeak.sourceforge.net/
# - ffmpeg: https://ffmpeg.org/download.html
Python Package - Choose Your Installation:
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# OPTION 1: Complete installation (RECOMMENDED)
# Includes: TTS Engine + MCP Server + CLI + Streamlit Web UI
pip install -e ".[complete]"
# OPTION 2: Without Streamlit (Developers)
# Includes: TTS Engine + MCP Server + CLI (no web UI)
pip install -e ".[mcp,cli]"
# OPTION 3: Streamlit Only
# Includes: TTS Engine + Streamlit Web UI (no MCP, no CLI)
pip install -e ".[streamlit]"
# OPTION 4: Core Only (Minimal)
# Includes: TTS Engine only (Python API)
pip install -e .
# OPTION 5: Everything (Contributors)
# Includes: All features + development tools
pip install -e ".[all]"
š See INSTALLATION.md for detailed installation options and troubleshooting.
Quick Launch
Streamlit Web UI:
# Cross-platform launcher
python run_streamlit.py
# Or use platform-specific scripts
./run_streamlit.sh # Linux/macOS
run_streamlit.bat # Windows
# Or direct
streamlit run streamlit_app.py
MCP Server (for Claude Desktop/Cursor):
- See MCP Integration section below
Basic Usage
from aparsoft_tts import TTSEngine
# Initialize engine
engine = TTSEngine()
# Generate speech
engine.generate(
text="Welcome to Kokoro YouTube TTS",
output_path="output.wav"
)
CLI Usage
# Generate audio
aparsoft-tts generate "Hello world" -o output.wav
# List available voices
aparsoft-tts voices
# Process video script
aparsoft-tts script video_script.txt -o voiceover.wav
# Batch generate
aparsoft-tts batch "Intro" "Body" "Outro" -d segments/
Available Voices
Male Voices:
am_adam- American male (natural inflection)am_michael- American male (deeper tones, professional)bm_george- British male (classic accent)bm_lewis- British male (modern accent)
Female Voices:
af_bella- American female (warm tones)af_nicole- American female (dynamic range)af_sarah- American female (clear articulation)af_sky- American female (youthful energy)bf_emma- British female (professional)bf_isabella- British female (soft tones)
Special Voices:
af- Default mix (50-50 blend of Bella and Sarah)
Advanced Usage
Custom Configuration
from aparsoft_tts import TTSEngine, TTSConfig
# Create custom configuration
config = TTSConfig(
voice="bm_george",
speed=1.2,
enhance_audio=True,
fade_duration=0.2
)
engine = TTSEngine(config=config)
engine.generate("Custom configuration", "output.wav")
Audio Enhancement
from aparsoft_tts.utils.audio import enhance_audio
# Generate raw audio
audio = engine.generate("Test audio")
# Apply custom enhancement
enhanced = enhance_audio(
audio,
sample_rate=24000,
normalize=True,
trim_silence=True,
trim_db=25.0,
noise_reduction=True,
add_fade=True,
fade_duration=0.15
)
Batch Processing
# Process multiple texts
texts = [
"Welcome to the tutorial",
"Let's explore the features",
"Thanks for watching"
]
paths = engine.batch_generate(
texts=texts,
output_dir="segments/",
voice="am_michael"
)
Script Processing
# Process complete video script with automatic text chunking
engine.process_script(
script_path="video_script.txt",
output_path="complete_voiceover.wav",
gap_duration=0.5, # Gap between paragraphs
voice="am_michael",
speed=1.0
)
# Note: Kokoro processes up to 510 tokens per pass.
# Long scripts are automatically chunked and combined seamlessly.
Podcast Generation (Multi-Voice)
Create podcast-style content with different voices and speeds per segment. Perfect for interviews, dialogues, or multi-speaker content.
Via MCP (Claude Desktop/Cursor):
"Create a podcast with these segments:
- Intro by am_michael: 'Welcome to Tech Talk'
- Guest by af_bella at 0.95 speed: 'Thanks for having me'
- Outro by am_michael: 'See you next week'"
Via Python API:
from aparsoft_tts.utils.audio import combine_audio_segments, save_audio
# Define podcast segments with different voices/speeds
segments = [
{"text": "Welcome to the show", "voice": "am_michael", "speed": 1.0},
{"text": "Great to be here", "voice": "af_bella", "speed": 0.95},
{"text": "Thanks for listening", "voice": "am_michael", "speed": 1.0},
]
# Generate each segment
audio_segments = []
for seg in segments:
audio = engine.generate(
text=seg["text"],
voice=seg["voice"],
speed=seg["speed"]
)
audio_segments.append(audio)
# Combine with gaps
combined = combine_audio_segments(
audio_segments,
sample_rate=24000,
gap_duration=0.6 # Pause between segments
)
# Save
save_audio(combined, "podcast.wav", sample_rate=24000)
Via Streamlit UI:
- Open Streamlit:
python run_streamlit.py - Navigate to "šļø Podcast Generation" tab
- Click "ā Add Segment" for each speaker
- Configure voice, speed, and text per segment
- Adjust gap duration in settings panel
- Click "š§ Generate Podcast"
Features:
- Per-segment voice control (host/guest conversations)
- Individual speed settings (emphasis/pacing)
- Configurable gaps between segments
- Audio enhancement (normalization, crossfades)
- Segment reordering (move up/down)
- Template support for quick start
Streaming Generation
# Generate audio in chunks
for chunk in engine.generate_stream(
text="Long text for streaming...",
voice="am_michael"
):
# Process chunk as it's generated
process_audio_chunk(chunk)
Model Context Protocol (MCP) Integration
Quick MCP Setup (5 Minutes)
What is MCP? Model Context Protocol lets Claude Desktop and Cursor generate speech directly from your conversations. No copy-pasting, no context switching.
For Developers: Quick Start
# 1. Find your Python path
which python # Linux/Mac
where python # Windows
# Example output: /home/ram/projects/youtube-creator/venv/bin/python
Claude Desktop:
# 1. Open config (creates if doesn't exist)
code ~/Library/Application\ Support/Claude/claude_desktop_config.json # macOS
code ~/.config/Claude/claude_desktop_config.json # Linux
notepad %APPDATA%\Claude\claude_desktop_config.json # Windows
# 2. Add this (use YOUR absolute Python path):
{
"mcpServers": {
"aparsoft-tts": {
"command": "/absolute/path/to/your/venv/bin/python",
"args": ["-m", "aparsoft_tts.mcp_server"]
}
}
}
# 3. Restart Claude (Cmd/Ctrl + R)
Cursor:
# 1. Create/edit config
mkdir -p ~/.cursor && code ~/.cursor/mcp.json
# 2. Add this (use YOUR absolute Python path):
{
"mcpServers": {
"aparsoft-tts": {
"command": "/absolute/path/to/your/venv/bin/python",
"args": ["-m", "aparsoft_tts.mcp_server"]
}
}
}
# 3. Restart Cursor completely
Testing MCP Server
# Quick test - should print server info
python -m aparsoft_tts.mcp_server --help
# Interactive testing with MCP Inspector
npx @modelcontextprotocol/inspector \
--command "/path/to/venv/bin/python" \
--args "-m" "aparsoft_tts.mcp_server"
# Opens UI at http://localhost:6274
Usage Examples
In Claude Desktop or Cursor, just ask naturally:
# Basic generation
"Generate speech for 'Welcome to my channel' using am_michael voice"
# Voice discovery
"List all available TTS voices"
# Batch processing
"Create voiceovers for these three segments: 'Intro', 'Main', 'Outro'"
# Script processing
"Process video_script.txt and create a complete voiceover"
# Custom parameters
"Generate 'Test message' at 1.3x speed with British accent"
MCP Tools Available
-
generate_speech: Single audio generation with full control
- Text input (up to 10,000 characters)
- Voice selection (6 voices)
- Speed control (0.5x - 2.0x)
- Audio enhancement toggle
-
list_voices: Get voice catalog with descriptions
-
batch_generate: Process multiple texts efficiently
-
process_script: Complete video script conversion
- Automatic paragraph detection
- Configurable gap duration
- Handles long texts via automatic chunking
Troubleshooting MCP
"Could not attach to MCP server"
- Use absolute path:
/full/path/to/venv/bin/python - Test server runs:
python -m aparsoft_tts.mcp_server - Check Python version:
python --version(needs 3.10+)
"Tool not found"
# Reinstall MCP dependencies
pip install -e ".[mcp]"
# Verify FastMCP
python -c "from fastmcp import FastMCP; print('ā
OK')"
Detailed Documentation: See TUTORIAL.md for comprehensive MCP guide with advanced features, debugging, and production deployment.
Docker Deployment
Build and Run
# Build image
docker build -t aparsoft-tts:latest .
# Run MCP server
docker run -d \
--name aparsoft-tts \
-v $(pwd)/outputs:/app/outputs \
-v $(pwd)/logs:/app/logs \
aparsoft-tts:latest
# Run CLI commands
docker run --rm \
-v $(pwd)/outputs:/app/outputs \
aparsoft-tts:latest \
aparsoft-tts generate "Docker test" -o /app/outputs/test.wav
Docker Compose
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
Environment Variables
# TTS Configuration
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
# MCP Server
MCP_SERVER_NAME=aparsoft-tts-server
MCP_ENABLE_RATE_LIMITING=true
# Logging
LOG_LEVEL=INFO
LOG_FORMAT=json
Project Structure
youtube-creator/
āāā aparsoft_tts/
ā āāā core/
ā ā āāā engine.py # TTS engine
ā āāā utils/
ā ā āāā audio.py # Audio processing with librosa
ā ā āāā logging.py # Structured logging
ā ā āāā exceptions.py # Custom exceptions
ā āāā config.py # Configuration management
ā āāā cli.py # CLI interface
ā āāā mcp_server.py # MCP server (FastMCP)
āāā tests/
ā āāā unit/ # Unit tests
ā āāā integration/ # Integration tests
āāā examples/ # Usage examples
āāā pyproject.toml # Project metadata
āāā Dockerfile # Docker configuration
āāā docker-compose.yml # Docker Compose config
Audio Processing
The toolkit enhances Kokoro's output with professional audio processing:
Features:
- Normalization: Consistent volume levels
- Silence Trimming: Remove quiet sections (configurable threshold)
- Noise Reduction: Spectral gating for cleaner audio
- Fade In/Out: Smooth transitions, prevents clicks
- Custom Processing: Extensible with librosa/scipy
Enhancement Pipeline:
from aparsoft_tts.utils.audio import enhance_audio, save_audio
# Generate raw audio
audio = engine.generate("Your text here")
# Apply enhancement pipeline
enhanced = enhance_audio(
audio,
sample_rate=24000,
normalize=True, # Normalize volume
trim_silence=True, # Trim silence
trim_db=20.0, # Threshold in dB
noise_reduction=True, # Apply noise gate
add_fade=True, # Add fade in/out
fade_duration=0.1 # 100ms fade
)
# Save enhanced audio
save_audio(enhanced, "enhanced.wav", sample_rate=24000)
Configuration
Using Configuration Files
from aparsoft_tts import TTSConfig, MCPConfig, LoggingConfig, Config
# TTS settings
tts_config = TTSConfig(
voice="am_michael",
speed=1.0,
enhance_audio=True,
sample_rate=24000,
output_format="wav"
)
# MCP server settings
mcp_config = MCPConfig(
server_name="aparsoft-tts-production",
enable_rate_limiting=True,
rate_limit_calls=100
)
# Logging settings
logging_config = LoggingConfig(
level="INFO",
format="json",
output="file"
)
# Combined configuration
config = Config(
tts=tts_config,
mcp=mcp_config,
logging=logging_config
)
Environment Variables
Create .env file:
# TTS Settings
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
TTS_SAMPLE_RATE=24000
# Audio Processing
TTS_TRIM_SILENCE=true
TTS_TRIM_DB=20.0
TTS_FADE_DURATION=0.1
# Logging
LOG_LEVEL=INFO
LOG_FORMAT=console
LOG_OUTPUT=stdout
Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=aparsoft_tts --cov-report=html
# Run specific test file
pytest tests/unit/test_engine.py
# Run only fast tests
pytest -m "not slow"
Development
Setup Development Environment
# Clone repository
git clone https://github.com/aparsoft/kokoro-youtube-tts.git
cd kokoro-youtube-tts
# Install with dev dependencies
pip install -e ".[dev,mcp,cli,all]"
# Install pre-commit hooks
pre-commit install
Running CI Locally
The project includes GitHub Actions workflow for CI/CD:
- Code quality checks (Black, Ruff, mypy)
- Tests on multiple Python versions (3.10, 3.11, 3.12)
- Docker build verification
- Security scanning with Trivy
API Reference
TTSEngine
Initialization:
TTSEngine(config: TTSConfig | None = None)
Methods:
generate(text, output_path, voice, speed, enhance)- Generate speechgenerate_stream(text, voice, speed)- Stream audio chunksbatch_generate(texts, output_dir, voice, speed)- Batch processingprocess_script(script_path, output_path, gap_duration, voice, speed)- Process scriptslist_voices()- Get available voices
Configuration Classes
TTSConfig- TTS engine settingsMCPConfig- MCP server configurationLoggingConfig- Logging configurationConfig- Main application configuration
Audio Utilities
enhance_audio(audio, ...)- Apply audio enhancementcombine_audio_segments(segments, ...)- Combine audio filessave_audio(audio, path, ...)- Save audio to fileload_audio(path, ...)- Load audio from filechunk_audio(audio, ...)- Split audio into chunksget_audio_duration(audio, ...)- Get audio duration
Examples
See the examples/ directory for complete examples:
basic_usage.py- Simple generation examplesyoutube_workflow.py- Complete YouTube video production workflow
Troubleshooting
espeak-ng not found
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak
# Windows: Download from http://espeak.sourceforge.net/
Audio quality issues
Enable audio enhancement:
engine.generate(text="Your text", enhance=True)
Import errors
Ensure virtual environment is activated:
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
Docker issues
Check container logs:
docker logs aparsoft-tts
Performance
Benchmarks (on typical consumer hardware):
- Model Loading: ~2-3 seconds (one-time)
- Generation Speed: ~0.5s per second of audio
- Memory Usage: ~2GB RAM (model loaded)
- Token Processing: Up to 510 tokens per pass
Text Length Limits:
Kokoro-82M processes up to 510 tokens in a single pass. For longer texts:
- Automatic chunking: Engine automatically splits long texts
- Script processing: Handles unlimited length via intelligent segmentation
- Batch processing: Each segment processed independently
Optimization Tips:
- Reuse engine instances (avoid reloading model)
- Disable enhancement for draft generations (
enhance=False) - Use streaming for long texts (automatic chunking)
- Batch process multiple files for efficiency
- Enable GPU acceleration on supported platforms
- For very long texts, use
process_script()for optimal chunking
Credits & Acknowledgements
This project builds upon excellent open-source software:
Core Dependencies
-
Kokoro-82M by hexgrad - Apache License 2.0
- Open-weight TTS model with 82M parameters
- Processes up to 510 tokens per pass
- Architectured by @yl4579 (StyleTTS 2)
- 24kHz audio output, <100 hours training data
-
librosa - ISC License
- Audio analysis and processing
-
FastMCP - MIT License
- Model Context Protocol server framework
Additional Dependencies
- soundfile - Audio I/O
- pydantic - Configuration management
- structlog - Structured logging
- typer - CLI framework
- pytest - Testing framework
Special Thanks
- š ļø @yl4579 for StyleTTS 2 architecture
- š hexgrad team for Kokoro model and inference library
- š Anthropic for Model Context Protocol
- š All contributors to the open-source dependencies
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Third-Party Licenses:
- Kokoro-82M: Apache License 2.0
- librosa: ISC License
- FastMCP: MIT License
Support
- Email: contact@aparsoft.com
- Website: https://aparsoft.com
- Issues: GitHub Issues
Citation
If you use this toolkit in your research or project, please cite:
@software{kokoro_youtube_tts,
author = {Aparsoft},
title = {Kokoro YouTube TTS: Comprehensive TTS Toolkit},
year = {2025},
url = {https://github.com/aparsoft/kokoro-youtube-tts}
}
For the Kokoro model:
@software{kokoro_tts,
author = {hexgrad},
title = {Kokoro-82M: Open-weight TTS Model},
year = {2024},
url = {https://huggingface.co/hexgrad/Kokoro-82M}
}
Built with ā¤ļø for the video creator community
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
