Google Live Translate MCP Server

Google Live Translate MCP Server

Provides translation and language detection tools to AI agents, processing text, audio, and Google Meet recordings with emotional voice style preservation via Google's Gemini Live API.

Category
Visit Server

README

Google Live Translate - Apify Actor & MCP Server

Apify Actor MCP Server Gemini

Google Live Translate is an industrial-grade dual system written in Node.js and TypeScript. It natively integrates Google's official multimodal translation API to process text, audio files, and Google Meet recordings in real time while preserving the emotional intonation and voice characteristics of the original speaker.

The project offers two production-ready distribution interfaces:

  1. Apify Actor (/actor): Engineered for high-throughput batch processing, audio streaming, and large-scale subtitle generation.
  2. MCP Server (/mcp-server): A server compatible with Anthropic's Model Context Protocol (MCP) that exposes translation and language detection tools directly to AI agents such as Claude Desktop.

🎯 Target Audience & 💡 Primary Use Cases

Commercial Value & Use Cases (Primary Use Cases)

  • Automated Content Localization: Automatically generate multi-language subtitles (SRT, VTT, JSON) for corporate videos, webinars, and tutorials in minutes.
  • International Meeting Auditing: Transcribe and translate sales or support calls in real-time, capturing emotional nuances and voice tone.
  • Machine Learning & Datasets: Process large volumes of audio files to compile clean datasets for AI training or sector-specific customer service analysis.

Target Audience

  • Data Analysts & Scientists: Retrieve clean JSON datasets containing exact timestamps (startMs, endMs), original transcriptions, translations, and accuracy scores (confidence).
  • Operations & Business Teams: Automate the translation of Google Meet recordings stored in Drive without writing code, improving global team collaboration.
  • AI Developers & Engineers: Seamlessly integrate audio translation with emotional voice cloning into local agent workflows via the MCP server.

⚙️ Key Features (What the Actor does)

  • Direct & Optimized API Calls: Connects natively to Google's official gemini-3.5-live-translate API via raw WebSockets.
  • Emotional Voice Style Preservation: Automatically detects tone, rhythm, and expressiveness when the preserveVoiceStyle: true setting is enabled.
  • Automatic Language Detection: Autodetects the source language (supporting 70+ languages under the BCP-47 standard) with a corresponding confidence metric.
  • Smart Audio Chunking: Processes large files by splitting audio into optimized segments (e.g., 8 seconds) using static FFmpeg to prevent context limit errors and guarantee precise timestamps.
  • Translated Audio Output: Captures the translated PCM audio streamed back from Gemini, concatenates all segments, and saves the final translated voice output as play-ready WAV and MP3 files.
  • Ultra-Fast Inactivity Latch: Implements a smart text-activity detector that closes the Bidi stream 4 seconds after transcription stops, avoiding the default 90-second socket timeout and reducing processing time by 90%.
  • Native Error Management: Instead of crashing on invalid inputs or network errors, it records a structured error payload in the dataset.
  • Flexible Export Formats: Outputs clean results in JSON, SRT, VTT, and plaintext formats.
  • Rate Limiting & Exponential Backoff: Built-in throttling at a maximum of 10 requests per second with automatic exponential retries for network drops or rate limits (HTTP 429).

Why Google Live Translate? (Competitive Advantage)

Unlike traditional translators that only process text and strip away the speaker's vocal characteristics, Google Live Translate merges acoustic transcription with a multimodal AI model. This setup delivers:

  1. An 85% reduction in latency compared to traditional cascaded pipelines (transcribe -> translate -> synthesize).
  2. True emotional voice style preservation (capturing humor, severity, or urgency) to improve empathy in automated customer service channels.
  3. Unique Technical Versatility: Runs on serverless cloud infrastructure (Apify) for large batch processing, or locally on a developer's machine (MCP) as an LLM utility extension.

⚙️ Input Schema

The Actor accepts the following parameters in its input form:

Field Type Required Default Description
mode string Yes text Supported modes: text, audio_file, audio_base64, audio_url, meet_recording
targetLang string Yes - Target language code selected from a dropdown (e.g., es, fr, zh, en)
inputText string No - Plain text to translate (required if mode is text)
audioFile string No - Upload local audio file directly from your computer (required if mode is audio_file)
audioBase64 string No - Base64-encoded audio track (required if mode is audio_base64)
audioUrl string No - Public URL of the audio/video file, or Google Drive URL (for Meet recordings)
sourceLang string No auto BCP-47 source language code or auto for auto-detection (dropdown select)
preserveVoiceStyle boolean No true Preserve the speaker's original emotional tone and voice style
outputFormat string No json Format of the output: json, srt, vtt, plaintext
googleCloudApiKey string No - Google Cloud API Key. If omitted, the Actor attempts to use ADC or Service Account JSON

📊 Output Schema

Audio translations output a detailed JSON structure saved to the Apify Dataset:

{
  "translationId": "aud-xyz123456",
  "sourceLang": "en",
  "targetLang": "es",
  "detectedLang": "en",
  "inputType": "audio_url",
  "segments": [
    {
      "index": 0,
      "startMs": 0,
      "endMs": 8000,
      "originalText": "Good morning and welcome to our annual review meeting.",
      "translatedText": "Buenos días y bienvenidos a nuestra reunión de revisión anual.",
      "confidence": 0.98,
      "voiceStylePreserved": true
    }
  ],
  "metadata": {
    "durationMs": 8000,
    "wordCount": 11,
    "processingMs": 1120,
    "modelVersion": "gemini-3.5-live-translate"
  },
  "srtContent": "1\n00:00:00,000 --> 00:00:08,000\nBuenos días y bienvenidos a nuestra reunión de revisión anual.\n",
  "vttContent": "WEBVTT\n\n1\n00:00:00.000 --> 00:00:08.000\nBuenos días y bienvenidos a nuestra reunión de revisión anual.\n",
  "plaintextContent": "Buenos días y bienvenidos a nuestra reunión de revisión anual."
}

🛠️ Detailed Architecture & How It Works

This Actor bridges local media files and the Gemini Multimodal Live API Bidi (bidirectional) WebSocket stream.

1. Audio Splitting and Preprocessing

When an audio/video file (or URL/uploaded file) is processed, the Actor uses a static binary of FFmpeg to:

  1. Split the file into small, digestible chunks (default: 8 seconds each) to guarantee context window availability and provide precise time stamps.
  2. Downsample and encode each audio chunk to 16kHz, mono, 16-bit signed little-endian PCM (s16le) format, which is the native input format expected by Gemini.

2. WebSocket Connection & Latch Handshake

For each chunk, the Actor opens a secure WebSocket connection to wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent.

  • Latch Mechanism: It sends the initial setup frame configure block and waits for the server's setupComplete confirmation. Audio data is only streamed after the latch completes, preventing the common 1007 protocol errors.
  • Audio Streaming: Audio chunks are read in memory and sent as base64-encoded frames to the socket.

3. Smart Transcription Inactivity Check

Unlike traditional text models, Gemini Live Translate keeps streaming real-time audio (including padding/silence) to keep the line active, which normally causes clients to hang until hitting the 90-second socket timeout.

  • Text-Activity Monitor: Our translator monitors incoming frames and keeps track of the last time actual transcription text was received.
  • Fast Exit: If 4 seconds pass without new transcription text after all audio chunks are sent, the socket is cleanly closed. This reduces processing time from 6 minutes to ~30 seconds for a 30-second audio track.

4. Audio Recovery & MP3 Packaging

The Actor captures the base64-encoded translated PCM audio (audio/pcm;rate=24000) returned by Gemini.

  • It concatenates the raw buffers.
  • Prepends a standard 44-byte WAV header with the exact sample rate (24000 Hz) and PCM properties.
  • Encodes the WAV file into a highly-compressed MP3 file (translated_output.mp3) using FFmpeg.

🚀 Installation & Quick Start Guide

Deploying the Actor to Apify

  1. Login to Apify CLI: If you have the Apify CLI installed globally, run:
    apify login
    
  2. Push the Actor: Run the push command from the /actor directory:
    npx apify-cli push
    
    Note: The included .actorignore file automatically excludes local audio test files and compilations (dist/, .system_generated/, *.wav, *.mp3) to keep deployment packages small and fast.
  3. Configure Settings: In the Apify Console, set your GOOGLE_CLOUD_API_KEY under the Environment Variables section.

Local Development and Testing

To test the audio translation and MP3 voice generation locally:

  1. Build the TypeScript files:
    npm run build
    
  2. Run the test script with your API key:
    $env:GOOGLE_CLOUD_API_KEY="YOUR_API_KEY"; node dist/test-local-audio.js
    
    This will translate the sample mp3 file, print subtitles to console, and save translated_output.wav and translated_output.mp3 in the workspace.

Integrating the MCP Server (Claude Desktop)

  1. Build the MCP server:
    cd mcp-server
    npm install
    npm run build
    
  2. Add the config to your Claude Desktop mcp_config.json:
    {
      "mcpServers": {
        "google-live-translate": {
          "command": "node",
          "args": ["/absolute/path/to/mcp-server/dist/index.js"],
          "env": {
            "GOOGLE_CLOUD_API_KEY": "YOUR_GEMINI_API_KEY"
          }
        }
      }
    }
    

💼 Business Use Cases & Monetization

Segment Workflow Cost & Value Model
Multilingual Support Call centers requiring real-time translation between agents and clients. $0.90 per successful API call
Video Subtitling Content creators and e-learning platforms publishing across global markets. $0.90 per successful API call
International Meetings Pipelines that translate Google Meet recordings and deliver SRT subtitles. $0.90 per successful API call
NLP Research & Datasets Translation datasets with confidence scores, metadata, and voice style details. $0.90 per successful API call

🔌 Automation & Integraciones (Automating)

  • No-Code Platforms: Trigger the Actor via Webhooks from Make, Zapier, n8n, or ActiveCampaign as soon as a new recording is uploaded.
  • Schedules: Set up Apify's internal Cron Schedules to automatically look for and translate new recordings in Google Drive at regular intervals (daily, weekly, etc.).
  • Cloud Databases: Export structured datasets directly to BigQuery, Snowflake, Amazon S3, Postgres, or vector databases for downstream RAG analytics pipelines.

🌟 Frequently Asked Questions (FAQ)

Does the system require a local FFmpeg installation?

No. The project includes @ffmpeg-installer/ffmpeg as a dependency, which installs a platform-specific static binary for FFmpeg (Windows, macOS, or Linux) out-of-the-box. This ensures audio splitting works automatically in local and Docker containers.

How are private Google Meet recordings fetched from Google Drive?

If you configure the MCP Server or the Actor using a Google Service Account JSON or Application Default Credentials (ADC) with Drive read access, the system automatically requests a secure OAuth token and sends it in the download request header (Authorization: Bearer <TOKEN>).

Can it translate video files as well as audio?

Yes. The internal FFmpeg compiler automatically demuxes the audio track from video files (such as .mp4, .mkv, or .webm) and transcodes it into a 16kHz mono PCM WAV stream for Gemini.

How does structured schema data improve AI engine discoverability?

Based on Generative Engine Optimization (GEO) research by Princeton University, serving rich, schema-structured JSON outputs and structured page markup increases the visibility and citation rates of resources by AI search engines (like ChatGPT, Perplexity, and Gemini) by up to 40%, ensuring accuracy and proper attribution of origin data.


[!NOTE]
This service communicates directly with official Google Cloud APIs, ensuring full data privacy compliance without using web scraping techniques.

<!-- JSON-LD Schema Markup for Search Engines & AI Crawlers (GEO) --> <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "SoftwareApplication", "name": "Google Live Translate", "operatingSystem": "All", "applicationCategory": "BusinessApplication", "downloadUrl": "https://apify.com/olican/google-live-translate", "softwareVersion": "1.0", "description": "Apify Actor & MCP Server for real-time translation, transcription, and language detection using Google Gemini 3.5 Live Translate with emotional voice preservation.", "offers": { "@type": "Offer", "price": "0.00", "priceCurrency": "USD" }, "author": { "@type": "Organization", "name": "olican" } } </script>

<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "Does Google Live Translate require local FFmpeg installation?", "acceptedAnswer": { "@type": "Answer", "text": "No. The system uses a pre-compiled, platform-specific static FFmpeg binary automatically installed via @ffmpeg-installer/ffmpeg, ensuring full compatibility out-of-the-box in local and Docker environments." } }, { "@type": "Question", "name": "How does emotional voice style preservation work?", "acceptedAnswer": { "@type": "Answer", "text": "By enabling the preserveVoiceStyle parameter, the Google Gemini 3.5 Live Translate model detects and mirrors the speaker's emotional intonation, speech patterns, and expressive rhythm in the generated translation metadata." } } ] } </script>

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured