youtube-mcp

youtube-mcp

Enables AI assistants to watch YouTube videos by extracting frames at scene changes and visual references, pairing each frame with the exact words spoken at that timestamp. Provides dense frame-transcript interleaving for any model.

Category
Visit Server

README

<p align="center"> <img src="https://img.shields.io/badge/MCP-YouTube-red?style=for-the-badge&logo=youtube&logoColor=white" alt="YouTube MCP" /> <img src="https://img.shields.io/badge/Gemini--Style-Video_Understanding-blue?style=for-the-badge" alt="Gemini-style" /> <img src="https://img.shields.io/badge/Any_Model-Universal-green?style=for-the-badge" alt="Any Model" /> </p>

<h1 align="center">youtube-mcp</h1>

<p align="center"> <strong>Give any AI the ability to watch YouTube videos.</strong><br/> Dense frame-transcript interleaving. Scene detection. Visual cue analysis.<br/> Gemini-style video understanding — for any model. </p>

<p align="center"> <a href="#-quick-install">Quick Install</a> • <a href="#-how-it-works">How It Works</a> • <a href="#-tools">Tools</a> • <a href="#-ai-installer">AI Installer</a> • <a href="#-examples">Examples</a> </p>


What is this?

An MCP server that lets AI assistants actually watch YouTube videos — not just read transcripts.

It extracts frames at scene changes and visual reference moments, pairs each frame with the exact words spoken at that timestamp, and returns everything as dense interleaved content. The AI sees what's on screen at the exact moment someone says "as you can see here."

No existing YouTube MCP server does this. Every other one is transcript-only. This is the first to combine transcript + vision.

The Token Math

Approach 10 min video Token cost
Gemini native (1 FPS, 258 tok/frame) 600 frames ~155K tokens
Sending raw JPEGs to any model 600 frames ~7.2M tokens
youtube-mcp (dense interleave, 1 frame/5s) 120 frames ~1.4M tokens
youtube-mcp describe mode (BLIP-2 → text) 600 frames ~18K tokens

Describe mode gives you 400x savings over raw images — full visual coverage as pure text.


Quick Install

Prerequisites

brew install yt-dlp ffmpeg

Setup

git clone https://github.com/coah80/youtube-mcp.git
cd youtube-mcp
bun install

Add to Claude Code

Add to your ~/.mcp.json:

{
  "mcpServers": {
    "youtube": {
      "command": "bun",
      "args": ["run", "/path/to/youtube-mcp/src/index.ts"]
    }
  }
}

Then restart Claude Code.

Optional: Enable describe mode (local BLIP-2)

This lets the AI "see" every second of a video as text descriptions — no images needed.

cd youtube-mcp
uv venv --python 3.11 .venv
uv pip install --python .venv/bin/python torch torchvision transformers accelerate pillow

First run downloads the BLIP-2 model (~3GB). Subsequent runs use the cached model.


AI Installer

For AI assistants: Copy and paste this into your MCP config to install youtube-mcp.

Add this to ~/.mcp.json (create the file if it doesn't exist):

{
  "mcpServers": {
    "youtube": {
      "command": "bun",
      "args": ["run", "/path/to/youtube-mcp/src/index.ts"]
    }
  }
}

Replace /path/to/youtube-mcp with the actual clone path.

If the file already has other servers, just add the "youtube" key inside the existing "mcpServers" object.


How It Works

YouTube URL
    │
    ├──→ yt-dlp ──→ Transcript (timestamped, word-level)
    │
    ├──→ yt-dlp ──→ Stream URL ──→ ffmpeg ──→ Frames
    │                                │
    │                    ┌────────────┼────────────┐
    │                    │            │            │
    │              Scene Change  Visual Cues  Regular
    │              Detection    in Transcript  Intervals
    │              (ffmpeg)     ("as you can   (fill gaps)
    │                           see here")
    │                    │            │            │
    │                    └────────────┼────────────┘
    │                                │
    │                    Frame Selection (prioritized)
    │                                │
    └──────────────────→ Dense Interleave
                              │
                   ┌──────────┴──────────┐
                   │                     │
              Image Mode            Describe Mode
           (raw screenshots)     (BLIP-2 captions)
                   │                     │
              Frame + "words         Text description
              spoken during          + "words spoken
              this frame"            during this frame"

Visual Cue Detection

The analyzer scans transcript text for 25+ patterns indicating the speaker is referencing something visual:

Pattern Example
as you can see "As you can see here, the API returns..."
look at this "Look at this graph"
on screen "What's on screen right now is..."
click here "If you click here, it opens..."
this diagram "In this diagram, we have..."
notice how "Notice how the color changes"

When detected, a frame is extracted at that exact timestamp — so the AI sees what the speaker was pointing at.

Scene Change Detection

Uses ffmpeg's scene detection filter (select=gt(scene,0.3)) to find where the visual content actually changes. This means:

  • Static talking-head sections get fewer frames (nothing's changing)
  • Slide transitions, screen recordings, demos get more frames (lots changing)

Segment-Based Processing

For videos longer than 5 minutes, watch_video processes in 3-minute segments with ~1 frame every 5 seconds. The AI calls it repeatedly:

watch_video(url) → first 3 min, 36 frames
watch_video(url, start_time=180) → next 3 min, 36 frames
watch_video(url, start_time=360) → next 3 min, 36 frames
...until the end

Each response tells the AI how to continue: "To continue watching, call watch_video with start_time=360"


Tools

Tool What it does
watch_video Dense frame↔transcript interleaving in segments. ~1 frame/5s. The full "watch" experience.
describe_video Full visual coverage via local BLIP-2. Every frame described as text. 400x fewer tokens than images.
get_scene_overview Composite grid image of scene changes. Quick visual summary of the whole video.
get_frames Extract frames at specific timestamps. For drilling into moments.
get_transcript Full timestamped transcript.
get_video_info Video metadata (title, channel, duration, views, description).

Examples

"Watch this video and summarize it"

The AI calls watch_video and gets interleaved content like:

[1:23] (scene change) "and here's where it gets interesting"
[screenshot of code editor]

[1:28] "if you look at this function right here"
[screenshot showing the function being discussed]

[1:33] (visual reference) "notice how the state updates"
[screenshot at the exact moment they reference the visual]

"Describe this entire lecture for me"

The AI calls describe_video and gets pure text:

[0:00] [VISUAL] A title slide reading "Introduction to Neural Networks"
[0:00] Welcome everyone to today's lecture on neural networks.
[0:05] [VISUAL] A diagram showing interconnected nodes in layers
[0:05] We'll start with the basic architecture.
[0:10] [VISUAL] The same diagram with arrows highlighted between layers
[0:10] Each connection between nodes has a weight...

600 frames of a 10-minute video → ~18K tokens. Fits in any context window.


Architecture

youtube-mcp/
├── src/
│   ├── index.ts        # MCP server — 6 tool definitions
│   ├── youtube.ts      # yt-dlp + ffmpeg operations (stream URL, frames, scenes)
│   ├── analyzer.ts     # Visual cue detection, chunking, dense interleaving
│   ├── describe.ts     # BLIP-2 integration (TypeScript wrapper)
│   └── captioner.py    # BLIP-2 inference (Python, runs on MPS/CUDA/CPU)
├── .venv/              # Python venv for BLIP-2 (optional)
├── package.json
├── tsconfig.json
└── README.md

Tech Stack


Compatibility

Works with any MCP-compatible AI assistant:

  • Claude Code (CLI, Desktop, Web)
  • Claude Desktop
  • Cursor
  • Any future MCP host

The image-based tools (watch_video, get_frames, get_scene_overview) require a vision-capable model.

The text-based tool (describe_video) works with any model — even text-only ones — because BLIP-2 converts all visuals to text locally.


Roadmap

  • [ ] Gemini Flash proxy mode — use Gemini Flash ($0.10/1M tokens) as a visual encoder for higher-quality frame descriptions than BLIP-2
  • [ ] Frame deduplication — perceptual similarity hashing to skip near-identical frames
  • [ ] Keyframe extraction — use ffmpeg I-frame detection instead of fixed intervals
  • [ ] Whisper integration — local audio transcription when YouTube captions aren't available
  • [ ] Timestamp burning — burn MM:SS into frame pixels (requires ffmpeg with libfreetype)
  • [ ] npm packagenpx youtube-mcp one-liner install

Research

This project was informed by deep research into how Gemini, GPT-4o, and open-source tools handle video:

  • Gemini processes video at 1 FPS using SigLIP-SO400M (258 tokens/frame) with native multimodal attention
  • GPT-4o sends base64 JPEG frames via the vision API (~12K tokens/frame)
  • No existing YouTube MCP server combines transcript + frame extraction — this is the first

Key references: LiveCC (CVPR 2025), mcp-deep-video, videostil, llm-video-frames


License

MIT


<p align="center"> Built by <a href="https://github.com/coah80">@coah80</a><br/> <sub>Give AI assistants the ability to watch YouTube. Star if this helped you.</sub> </p>

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured