mcp-server-documentary-generation
Enables autonomous generation of long-form YouTube documentaries (15-25 minutes) with minimal human intervention, focusing on historical niches like Byzantine history.
README
mcp-server-documentary-generation
Status: MVP / Proof of Concept End-to-end pipeline is working. See Next Steps for planned upgrades from local inference to production APIs.
An autonomous documentary generation system orchestrated by Claude Code via MCP (Model Context Protocol). Given a topic, it produces a narrated video with AI-generated visuals — with minimal human intervention.
Built as a portfolio project demonstrating agentic AI orchestration, local ML inference, and multi-modal content pipelines.
What It Does
- Fetches and summarises Wikipedia research on a topic (Greek-first, English fallback)
- Parses a structured script into timestamped scenes
- Generates image prompts per scene (Byzantine manuscript style)
- Renders images locally with Stable Diffusion 1.5
- Synthesises Greek narration with Chatterbox Multilingual TTS
- Assembles everything into a video with Ken Burns effect via FFmpeg
Test output: a ~2-minute Greek-language documentary on the Fall of Constantinople (1453).
Architecture
flowchart TD
A([Topic]) --> B[research\nWikipedia API]
B --> C[Script\nClaude inline]
C --> D[build_storyboard\nparse scenes]
D --> E[Claude fills\nimage prompts]
E --> F[image_gen\nSD 1.5 CPU]
E --> G[tts_batch\nChatterbox TTS]
F --> H[assemble\nFFmpeg]
G --> H
H --> I([video.mp4])
style A fill:#1a1a2e,color:#eee,stroke:#555
style I fill:#1a1a2e,color:#eee,stroke:#555
MCP Tool Layer
Claude Code acts as the orchestrator. Each stage is exposed as an MCP tool that Claude can call autonomously:
| Tool | Description |
|---|---|
research |
Fetch Wikipedia outline, save to research/<topic>/outline.txt |
build_storyboard |
Parse script into scenes, initialise project folder |
save_storyboard |
Persist scenes with image prompts filled by Claude |
tts_batch |
Synthesise narration WAV per scene (checkpointed) |
image_gen |
Generate image PNG per scene (checkpointed) |
assemble |
Combine audio + image → MP4 with Ken Burns effect |
Stack
| Component | Technology | Notes |
|---|---|---|
| Orchestration | Claude Code (MCP) | No extra API calls — Claude itself fills prompts |
| Research | Wikipedia API | Greek Wikipedia first, English fallback |
| Image generation | SD 1.5 (runwayml/stable-diffusion-v1-5) |
Local CPU, ~4 min/image |
| TTS | Chatterbox Multilingual (ResembleAI) | Greek (el), MIT licence, local CPU |
| Audio stretch | librosa time_stretch |
Slows narration to 0.85× for documentary pacing |
| Video assembly | FFmpeg | Ken Burns (zoompan), AAC audio, H.264 |
| Visual style | Byzantine manuscript prompts | Pencil sketch, aged parchment, charcoal, 16:9 |
Project Structure
mcp-server-documentary-generation/
├── server.py # MCP server — registers all tools
├── tools/
│ ├── project.py # Folder layout helpers (slugify, scene_paths)
│ ├── research.py # Wikipedia fetch → outline.txt
│ ├── storyboard.py # Script parser → scenes.json
│ ├── tts_batch.py # Chatterbox batch TTS
│ ├── image_gen.py # SD 1.5 batch image generation
│ └── assemble.py # FFmpeg video assembly
├── script/
│ └── aloси_1453.txt # Test script (Greek, 5 scenes)
├── storyboard/
│ └── scenes.json # Committed scene index with prompts
└── generated/ # Gitignored — all media output lives here
└── <title>/
├── scenes.json
├── video.mp4
└── scene_XX/
├── script.txt
├── image.png
└── audio.wav
Output Example
Topic: Η Άλωση της Κωνσταντινούπολης (1453) Language: Greek Scenes: 5 (HOOK → ΠΕΡΙΒΑΛΛΟΝ → ΠΤΩΣΗ → ΤΕΛΟΣ → ΕΠΙΛΟΓΟΣ) Runtime: ~2 minutes Style: Byzantine manuscript illustration, pencil sketch on aged parchment
Running It
Prerequisites
pip install -r requirements.txt
winget install ffmpeg # Windows
Run a stage manually
# Research
py -m tools.research "Άλωση της Κωνσταντινούπολης"
# Parse script into scenes
py -m tools.storyboard script/aloси_1453.txt --title "Άλωση 1453"
# Generate TTS for all scenes
py -m tools.tts_batch generated/Άλωση_1453/scenes.json --title "Άλωση 1453"
# Generate images (20 diffusion steps)
py -m tools.image_gen generated/Άλωση_1453/scenes.json --title "Άλωση 1453" --steps 20
# Assemble final video
py -m tools.assemble generated/Άλωση_1453/scenes.json --title "Άλωση 1453"
Run via MCP (Claude Code)
Add to your Claude Code MCP config:
{
"mcpServers": {
"documentary": {
"command": "py",
"args": ["-m", "server"],
"cwd": "/path/to/mcp-server-documentary-generation"
}
}
}
Then Claude Code can call research, build_storyboard, tts_batch, image_gen, and assemble as tools directly.
Checkpointing
Every tool skips files that already exist. You can interrupt and resume at any stage without re-running completed work:
tts_batch→ skips scenes with existingaudio.wavimage_gen→ skips scenes with existingimage.pngassemble→ skips scenes missing either asset
Next Steps
This MVP validates the end-to-end pipeline. Production upgrades planned:
Quality
- [ ] Images: Swap SD 1.5 CPU → FLUX Dev via Replicate API (10× better quality, seconds not minutes)
- [ ] TTS: Swap Chatterbox CPU → ElevenLabs or Azure Neural TTS (more natural, faster)
- [ ] Research: Add ChromaDB RAG for multi-source grounding beyond Wikipedia
Features
- [ ] Subtitles: WhisperX forced alignment →
.srtburn-in - [ ] Music: Overlay public domain tracks (Musopen)
- [ ] Upload: YouTube Data API v3 — auto title, description, chapters, thumbnail
- [ ] Thumbnail: Auto-generate from scene 1 image + title overlay
Scale
- [ ] Parameterise topic, language, and style via Claude conversation
- [ ] Support 15–25 min documentaries (currently ~2 min test)
- [ ] Fine-tune image prompts per historical period (Byzantine, Ottoman, Classical Greek)
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.