MCP Servers

Kokoro MCP Server

Provides text-to-speech generation using the Kokoro-82M model, enabling AI assistants to generate voiceovers and audio content directly within Claude Desktop and Cursor.

README

Kokoro MCP SERVER: Text To Speech (TTS)

A comprehensive Text-to-Speech toolkit built on Kokoro-82M with audio enhancement, Model Context Protocol (MCP) server integration, CLI interface, and Docker deployment.

📺 Demo Video

Features

Kokoro-82M TTS Engine: Open-weight model with 82M parameters (510 tokens per pass)
🌐 Streamlit Web UI: Enterprise-grade management interface with real-time preview (OPTIONAL)
Audio Enhancement: Professional processing with librosa (normalization, noise reduction, fade in/out)
MCP Server: Model Context Protocol integration for Claude Desktop, Cursor, and other AI tools (OPTIONAL)
CLI Interface: Command-line tools for quick generation (OPTIONAL)
Batch Processing: Generate multiple audio files efficiently
Script Processing: Convert complete video scripts with automatic text chunking
Docker Support: Containerization with docker-compose
Enterprise Features: Structured logging, configuration management, comprehensive testing
CI/CD: GitHub Actions pipeline with automated testing

Streamlit Web Interface (Optional)

Beautiful web UI for managing all TTS functionality:

🎯 Single Generation - Convert text with real-time preview
📦 Batch Processing - Process multiple texts in one go
📄 Script Processing - Complete video script conversion
🔍 Voice Explorer - Compare all 12 voices side-by-side
⚙️ Configuration - Manage settings visually
📊 Analytics - Track generations with charts and statistics

Install with: pip install -e ".[streamlit]" or pip install -e ".[complete]"

Quick Start:

python run_streamlit.py
# Opens at http://localhost:8501

📚 See STREAMLIT_README.md for complete Streamlit documentation.

Building on Kokoro-82M

What Kokoro-82M Provides Out-of-the-Box: Kokoro-82M is an exceptional open-weight TTS model that delivers: core neural TTS inference with 82M parameters, a basic Python inference library (KPipeline), 10 professional voice packs (male/female, American/British), phonemization (G2P) system, and raw 24kHz audio output with a 510-token processing limit per pass.

What aparsoft-tts Adds: We integrate Kokoro-82M's excellent TTS inference with comprehensive development tooling and workflow enhancements. This toolkit adds:

Audio post-processing - Normalization, noise reduction, silence trimming, and fade in/out using librosa
Automated script workflows - Direct script-to-voiceover conversion with paragraph detection and gap management
IDE-native generation - MCP server integration eliminates context switching for Claude Desktop and Cursor users
Deployment infrastructure - Docker deployment, structured logging, configuration management, and comprehensive testing
Batch processing - CLI and Python APIs for processing multiple segments efficiently

Technical Implementation

Audio Enhancement (librosa Integration):

This toolkit adds an audio processing pipeline on Kokoro generated TTS output:

# Without enhancement - raw Kokoro output
audio = kokoro_pipeline(text)

# With enhancement
audio = enhance_audio(
    kokoro_output,
    normalize=True,        # Consistent volume
    trim_silence=True,     # Remove dead air
    noise_reduction=True,  # Spectral gating
    add_fade=True         # Smooth transitions
)

Result: Voiceovers ready for YouTube, podcasts, or content creation without additional audio editing.

MCP Server Integration:

Traditional workflow:

# 1. Write script in Claude/Cursor
# 2. Copy text to terminal
# 3. Run Python script
# 4. Switch back to editor
# 5. Repeat for each segment

With MCP server:

# In Claude Desktop or Cursor:
"Generate voiceover for this section using am_michael voice"
# Done. Audio generated without leaving your workspace.

Workflow Enhancement:

Content creators: Write scripts in AI editors, generate voiceovers inline
Developers: Generate test audio during development without context switching
Teams: Standardized TTS across tools (Claude, Cursor, CLI, API)
Automation: AI agents can generate audio as part of content pipelines

Deployment Features:

The toolkit wraps Kokoro with common deployment and development needs:

Configuration management - Environment-based settings, no hardcoded values
Structured logging - JSON logs for aggregation, correlation IDs for tracing
Error handling - Custom exceptions, graceful failures, detailed error context
Testing - Comprehensive test suite, CI/CD integration
Docker deployment - Containerized with health checks, resource limits
CLI interface - Quick access without writing code

Use Cases

YouTube/Podcast Production:

# Process entire video script with proper gaps
engine.process_script("script.txt", "voiceover.wav", gap_duration=0.5)

AI-Assisted Content Creation:

# In Claude Desktop with MCP:
User: "Generate a 30-second intro for my coding tutorial"
Claude: *generates script and voiceover via MCP*

Batch Content Generation:

# Generate 100 audio segments for e-learning course
engine.batch_generate(lesson_texts, output_dir="lessons/")

Development/Testing:

# Quick CLI test during development
aparsoft-tts generate "Test message" -o test.wav

Quick Start

Installation

System Dependencies (Required):

# Ubuntu/Debian
sudo apt-get install espeak-ng ffmpeg libsndfile1

# macOS
brew install espeak ffmpeg

# Windows: Download from
# - espeak-ng: http://espeak.sourceforge.net/
# - ffmpeg: https://ffmpeg.org/download.html

Python Package - Choose Your Installation:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# OPTION 1: Complete installation (RECOMMENDED)
# Includes: TTS Engine + MCP Server + CLI + Streamlit Web UI
pip install -e ".[complete]"

# OPTION 2: Without Streamlit (Developers)
# Includes: TTS Engine + MCP Server + CLI (no web UI)
pip install -e ".[mcp,cli]"

# OPTION 3: Streamlit Only
# Includes: TTS Engine + Streamlit Web UI (no MCP, no CLI)
pip install -e ".[streamlit]"

# OPTION 4: Core Only (Minimal)
# Includes: TTS Engine only (Python API)
pip install -e .

# OPTION 5: Everything (Contributors)
# Includes: All features + development tools
pip install -e ".[all]"

📚 See INSTALLATION.md for detailed installation options and troubleshooting.

Quick Launch

Streamlit Web UI:

# Cross-platform launcher
python run_streamlit.py

# Or use platform-specific scripts
./run_streamlit.sh      # Linux/macOS
run_streamlit.bat        # Windows

# Or direct
streamlit run streamlit_app.py

MCP Server (for Claude Desktop/Cursor):

See MCP Integration section below

Basic Usage

from aparsoft_tts import TTSEngine

# Initialize engine
engine = TTSEngine()

# Generate speech
engine.generate(
    text="Welcome to Kokoro YouTube TTS",
    output_path="output.wav"
)

CLI Usage

# Generate audio
aparsoft-tts generate "Hello world" -o output.wav

# List available voices
aparsoft-tts voices

# Process video script
aparsoft-tts script video_script.txt -o voiceover.wav

# Batch generate
aparsoft-tts batch "Intro" "Body" "Outro" -d segments/

Available Voices

Male Voices:

am_adam - American male (natural inflection)
am_michael - American male (deeper tones, professional)
bm_george - British male (classic accent)
bm_lewis - British male (modern accent)

Female Voices:

af_bella - American female (warm tones)
af_nicole - American female (dynamic range)
af_sarah - American female (clear articulation)
af_sky - American female (youthful energy)
bf_emma - British female (professional)
bf_isabella - British female (soft tones)

Special Voices:

af - Default mix (50-50 blend of Bella and Sarah)

Advanced Usage

Custom Configuration

from aparsoft_tts import TTSEngine, TTSConfig

# Create custom configuration
config = TTSConfig(
    voice="bm_george",
    speed=1.2,
    enhance_audio=True,
    fade_duration=0.2
)

engine = TTSEngine(config=config)
engine.generate("Custom configuration", "output.wav")

Audio Enhancement

from aparsoft_tts.utils.audio import enhance_audio

# Generate raw audio
audio = engine.generate("Test audio")

# Apply custom enhancement
enhanced = enhance_audio(
    audio,
    sample_rate=24000,
    normalize=True,
    trim_silence=True,
    trim_db=25.0,
    noise_reduction=True,
    add_fade=True,
    fade_duration=0.15
)

Batch Processing

# Process multiple texts
texts = [
    "Welcome to the tutorial",
    "Let's explore the features",
    "Thanks for watching"
]

paths = engine.batch_generate(
    texts=texts,
    output_dir="segments/",
    voice="am_michael"
)

Script Processing

# Process complete video script with automatic text chunking
engine.process_script(
    script_path="video_script.txt",
    output_path="complete_voiceover.wav",
    gap_duration=0.5,  # Gap between paragraphs
    voice="am_michael",
    speed=1.0
)

# Note: Kokoro processes up to 510 tokens per pass.
# Long scripts are automatically chunked and combined seamlessly.

Podcast Generation (Multi-Voice)

Create podcast-style content with different voices and speeds per segment. Perfect for interviews, dialogues, or multi-speaker content.

Via MCP (Claude Desktop/Cursor):

"Create a podcast with these segments:
- Intro by am_michael: 'Welcome to Tech Talk'
- Guest by af_bella at 0.95 speed: 'Thanks for having me'
- Outro by am_michael: 'See you next week'"

Via Python API:

from aparsoft_tts.utils.audio import combine_audio_segments, save_audio

# Define podcast segments with different voices/speeds
segments = [
    {"text": "Welcome to the show", "voice": "am_michael", "speed": 1.0},
    {"text": "Great to be here", "voice": "af_bella", "speed": 0.95},
    {"text": "Thanks for listening", "voice": "am_michael", "speed": 1.0},
]

# Generate each segment
audio_segments = []
for seg in segments:
    audio = engine.generate(
        text=seg["text"],
        voice=seg["voice"],
        speed=seg["speed"]
    )
    audio_segments.append(audio)

# Combine with gaps
combined = combine_audio_segments(
    audio_segments,
    sample_rate=24000,
    gap_duration=0.6  # Pause between segments
)

# Save
save_audio(combined, "podcast.wav", sample_rate=24000)

Via Streamlit UI:

Open Streamlit: python run_streamlit.py
Navigate to "🎙️ Podcast Generation" tab
Click "➕ Add Segment" for each speaker
Configure voice, speed, and text per segment
Adjust gap duration in settings panel
Click "🎧 Generate Podcast"

Features:

Per-segment voice control (host/guest conversations)
Individual speed settings (emphasis/pacing)
Configurable gaps between segments
Audio enhancement (normalization, crossfades)
Segment reordering (move up/down)
Template support for quick start

Streaming Generation

# Generate audio in chunks
for chunk in engine.generate_stream(
    text="Long text for streaming...",
    voice="am_michael"
):
    # Process chunk as it's generated
    process_audio_chunk(chunk)

Model Context Protocol (MCP) Integration

Quick MCP Setup (5 Minutes)

What is MCP? Model Context Protocol lets Claude Desktop and Cursor generate speech directly from your conversations. No copy-pasting, no context switching.

For Developers: Quick Start

# 1. Find your Python path
which python  # Linux/Mac
where python  # Windows

# Example output: /home/ram/projects/youtube-creator/venv/bin/python

Claude Desktop:

# 1. Open config (creates if doesn't exist)
code ~/Library/Application\ Support/Claude/claude_desktop_config.json  # macOS
code ~/.config/Claude/claude_desktop_config.json  # Linux
notepad %APPDATA%\Claude\claude_desktop_config.json  # Windows

# 2. Add this (use YOUR absolute Python path):
{
  "mcpServers": {
    "aparsoft-tts": {
      "command": "/absolute/path/to/your/venv/bin/python",
      "args": ["-m", "aparsoft_tts.mcp_server"]
    }
  }
}

# 3. Restart Claude (Cmd/Ctrl + R)

Cursor:

# 1. Create/edit config
mkdir -p ~/.cursor && code ~/.cursor/mcp.json

# 2. Add this (use YOUR absolute Python path):
{
  "mcpServers": {
    "aparsoft-tts": {
      "command": "/absolute/path/to/your/venv/bin/python",
      "args": ["-m", "aparsoft_tts.mcp_server"]
    }
  }
}

# 3. Restart Cursor completely

Testing MCP Server

# Quick test - should print server info
python -m aparsoft_tts.mcp_server --help

# Interactive testing with MCP Inspector
npx @modelcontextprotocol/inspector \
  --command "/path/to/venv/bin/python" \
  --args "-m" "aparsoft_tts.mcp_server"
# Opens UI at http://localhost:6274

Usage Examples

In Claude Desktop or Cursor, just ask naturally:

# Basic generation
"Generate speech for 'Welcome to my channel' using am_michael voice"

# Voice discovery
"List all available TTS voices"

# Batch processing
"Create voiceovers for these three segments: 'Intro', 'Main', 'Outro'"

# Script processing
"Process video_script.txt and create a complete voiceover"

# Custom parameters
"Generate 'Test message' at 1.3x speed with British accent"

MCP Tools Available

generate_speech: Single audio generation with full control
- Text input (up to 10,000 characters)
- Voice selection (6 voices)
- Speed control (0.5x - 2.0x)
- Audio enhancement toggle
list_voices: Get voice catalog with descriptions
batch_generate: Process multiple texts efficiently
process_script: Complete video script conversion
- Automatic paragraph detection
- Configurable gap duration
- Handles long texts via automatic chunking

Troubleshooting MCP

"Could not attach to MCP server"

Use absolute path: /full/path/to/venv/bin/python
Test server runs: python -m aparsoft_tts.mcp_server
Check Python version: python --version (needs 3.10+)

"Tool not found"

# Reinstall MCP dependencies
pip install -e ".[mcp]"

# Verify FastMCP
python -c "from fastmcp import FastMCP; print('✅ OK')"

Detailed Documentation: See TUTORIAL.md for comprehensive MCP guide with advanced features, debugging, and production deployment.

Docker Deployment

Build and Run

# Build image
docker build -t aparsoft-tts:latest .

# Run MCP server
docker run -d \
  --name aparsoft-tts \
  -v $(pwd)/outputs:/app/outputs \
  -v $(pwd)/logs:/app/logs \
  aparsoft-tts:latest

# Run CLI commands
docker run --rm \
  -v $(pwd)/outputs:/app/outputs \
  aparsoft-tts:latest \
  aparsoft-tts generate "Docker test" -o /app/outputs/test.wav

Docker Compose

# Start services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Environment Variables

# TTS Configuration
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true

# MCP Server
MCP_SERVER_NAME=aparsoft-tts-server
MCP_ENABLE_RATE_LIMITING=true

# Logging
LOG_LEVEL=INFO
LOG_FORMAT=json

Project Structure

youtube-creator/
├── aparsoft_tts/
│   ├── core/
│   │   └── engine.py          # TTS engine
│   ├── utils/
│   │   ├── audio.py           # Audio processing with librosa
│   │   ├── logging.py         # Structured logging
│   │   └── exceptions.py      # Custom exceptions
│   ├── config.py              # Configuration management
│   ├── cli.py                 # CLI interface
│   └── mcp_server.py          # MCP server (FastMCP)
├── tests/
│   ├── unit/                  # Unit tests
│   └── integration/           # Integration tests
├── examples/                  # Usage examples
├── pyproject.toml             # Project metadata
├── Dockerfile                 # Docker configuration
└── docker-compose.yml         # Docker Compose config

Audio Processing

The toolkit enhances Kokoro's output with professional audio processing:

Features:

Normalization: Consistent volume levels
Silence Trimming: Remove quiet sections (configurable threshold)
Noise Reduction: Spectral gating for cleaner audio
Fade In/Out: Smooth transitions, prevents clicks
Custom Processing: Extensible with librosa/scipy

Enhancement Pipeline:

from aparsoft_tts.utils.audio import enhance_audio, save_audio

# Generate raw audio
audio = engine.generate("Your text here")

# Apply enhancement pipeline
enhanced = enhance_audio(
    audio,
    sample_rate=24000,
    normalize=True,      # Normalize volume
    trim_silence=True,   # Trim silence
    trim_db=20.0,        # Threshold in dB
    noise_reduction=True,  # Apply noise gate
    add_fade=True,       # Add fade in/out
    fade_duration=0.1    # 100ms fade
)

# Save enhanced audio
save_audio(enhanced, "enhanced.wav", sample_rate=24000)

Configuration

Using Configuration Files

from aparsoft_tts import TTSConfig, MCPConfig, LoggingConfig, Config

# TTS settings
tts_config = TTSConfig(
    voice="am_michael",
    speed=1.0,
    enhance_audio=True,
    sample_rate=24000,
    output_format="wav"
)

# MCP server settings
mcp_config = MCPConfig(
    server_name="aparsoft-tts-production",
    enable_rate_limiting=True,
    rate_limit_calls=100
)

# Logging settings
logging_config = LoggingConfig(
    level="INFO",
    format="json",
    output="file"
)

# Combined configuration
config = Config(
    tts=tts_config,
    mcp=mcp_config,
    logging=logging_config
)

Environment Variables

Create .env file:

# TTS Settings
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
TTS_SAMPLE_RATE=24000

# Audio Processing
TTS_TRIM_SILENCE=true
TTS_TRIM_DB=20.0
TTS_FADE_DURATION=0.1

# Logging
LOG_LEVEL=INFO
LOG_FORMAT=console
LOG_OUTPUT=stdout

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=aparsoft_tts --cov-report=html

# Run specific test file
pytest tests/unit/test_engine.py

# Run only fast tests
pytest -m "not slow"

Development

Setup Development Environment

# Clone repository
git clone https://github.com/aparsoft/kokoro-youtube-tts.git
cd kokoro-youtube-tts

# Install with dev dependencies
pip install -e ".[dev,mcp,cli,all]"

# Install pre-commit hooks
pre-commit install

Running CI Locally

The project includes GitHub Actions workflow for CI/CD:

Code quality checks (Black, Ruff, mypy)
Tests on multiple Python versions (3.10, 3.11, 3.12)
Docker build verification
Security scanning with Trivy

API Reference

TTSEngine

Initialization:

TTSEngine(config: TTSConfig | None = None)

Methods:

generate(text, output_path, voice, speed, enhance) - Generate speech
generate_stream(text, voice, speed) - Stream audio chunks
batch_generate(texts, output_dir, voice, speed) - Batch processing
process_script(script_path, output_path, gap_duration, voice, speed) - Process scripts
list_voices() - Get available voices

Configuration Classes

TTSConfig - TTS engine settings
MCPConfig - MCP server configuration
LoggingConfig - Logging configuration
Config - Main application configuration

Audio Utilities

enhance_audio(audio, ...) - Apply audio enhancement
combine_audio_segments(segments, ...) - Combine audio files
save_audio(audio, path, ...) - Save audio to file
load_audio(path, ...) - Load audio from file
chunk_audio(audio, ...) - Split audio into chunks
get_audio_duration(audio, ...) - Get audio duration

Examples

See the examples/ directory for complete examples:

basic_usage.py - Simple generation examples
youtube_workflow.py - Complete YouTube video production workflow

Troubleshooting

espeak-ng not found

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak

# Windows: Download from http://espeak.sourceforge.net/

Audio quality issues

Enable audio enhancement:

engine.generate(text="Your text", enhance=True)

Import errors

Ensure virtual environment is activated:

source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

Docker issues

Check container logs:

docker logs aparsoft-tts

Performance

Benchmarks (on typical consumer hardware):

Model Loading: ~2-3 seconds (one-time)
Generation Speed: ~0.5s per second of audio
Memory Usage: ~2GB RAM (model loaded)
Token Processing: Up to 510 tokens per pass

Text Length Limits:

Kokoro-82M processes up to 510 tokens in a single pass. For longer texts:

Automatic chunking: Engine automatically splits long texts
Script processing: Handles unlimited length via intelligent segmentation
Batch processing: Each segment processed independently

Optimization Tips:

Reuse engine instances (avoid reloading model)
Disable enhancement for draft generations (enhance=False)
Use streaming for long texts (automatic chunking)
Batch process multiple files for efficiency
Enable GPU acceleration on supported platforms
For very long texts, use process_script() for optimal chunking

Credits & Acknowledgements

This project builds upon excellent open-source software:

Core Dependencies

Kokoro-82M by hexgrad - Apache License 2.0
- Open-weight TTS model with 82M parameters
- Processes up to 510 tokens per pass
- Architectured by @yl4579 (StyleTTS 2)
- 24kHz audio output, <100 hours training data
librosa - ISC License
- Audio analysis and processing
FastMCP - MIT License
- Model Context Protocol server framework

Additional Dependencies

soundfile - Audio I/O
pydantic - Configuration management
structlog - Structured logging
typer - CLI framework
pytest - Testing framework

Special Thanks

🛠️ @yl4579 for StyleTTS 2 architecture
🏆 hexgrad team for Kokoro model and inference library
🌐 Anthropic for Model Context Protocol
📊 All contributors to the open-source dependencies

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Third-Party Licenses:

Kokoro-82M: Apache License 2.0
librosa: ISC License
FastMCP: MIT License

Support

Email: contact@aparsoft.com
Website: https://aparsoft.com
Issues: GitHub Issues

Citation

If you use this toolkit in your research or project, please cite:

@software{kokoro_youtube_tts,
  author = {Aparsoft},
  title = {Kokoro YouTube TTS: Comprehensive TTS Toolkit},
  year = {2025},
  url = {https://github.com/aparsoft/kokoro-youtube-tts}
}

For the Kokoro model:

@software{kokoro_tts,
  author = {hexgrad},
  title = {Kokoro-82M: Open-weight TTS Model},
  year = {2024},
  url = {https://huggingface.co/hexgrad/Kokoro-82M}
}

Built with ❤️ for the video creator community

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured