VoiceLayer

VoiceLayer

VoiceLayer MCP server enables AI coding assistants to speak and hear via local, on-device speech-to-text and text-to-speech, with no cloud dependencies.

Category
Visit Server

README

VoiceLayer

Your AI agent can't hear you. VoiceLayer gives it ears and a voice.

npm License: Apache-2.0 MCP Tools Tests

Voice I/O for AI coding assistants. Press F5, speak to Claude Code, get on-device transcription in under 1.5 seconds. Your AI speaks back. Works with any MCP client.

  You ā”€ā”€šŸŽ¤ā”€ā”€> whisper.cpp ──> Claude Code ──> edge-tts ā”€ā”€šŸ”Šā”€ā”€> You
         STT (local)           MCP tools         TTS (free)

Local-first. Free. Open-source. No cloud APIs, no API keys, no data leaves your machine. Part of the Golems ecosystem.

Website | Docs | npm

VoiceLayer runs as a persistent singleton daemon on a Unix socket — every Claude session connects through a lightweight socat shim instead of spawning its own process. 2 canonical MCP tools plus 9 backward-compatible aliases ship with full ToolAnnotations.

Architecture

                  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
                  │         VoiceLayer Daemon            │
                  │     /tmp/voicelayer-mcp.sock         │
                  │                                      │
                  │  MCP JSONRPC ──> Tool Handlers       │
                  │  (Content-Length     ā”œā”€ā”€ voice_speak  │
                  │   framing)          └── voice_ask    │
                  │                                      │
                  │  TTS: edge-tts (retry + 30s timeout) │
                  │  STT: whisper.cpp / Wispr Flow       │
                  │  VAD: Silero ONNX (speech detection)  │
                  │  IPC: Voice Bar ← NDJSON events      │
                  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                             │ Unix socket
              ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
              │              │              │
         Claude Code    Claude Code    Cursor/Codex
         (socat shim)  (socat shim)   (socat shim)

Why a daemon? The original design spawned a new Bun process per Claude session. With 17+ repos open, that meant 17 competing processes (700+ MB RAM), fighting over one Voice Bar socket, crashing edge-tts with PATH issues, and leaving orphans that never died. The daemon architecture — shipped in PRs #67-72 — replaced all of that with a single process and socat shims.

Metric Before (spawn-per-session) After (daemon)
Processes N per session (17+ typical) 1 daemon + socat shims
RAM ~700 MB (17 x 41 MB) ~50 MB
Orphan cleanup Manual pkill PID lockfile auto-kills stale
edge-tts failures Random (PATH, contention) Retry with 30s hard timeout
voice_ask hang Up to 300s (5 min!) 30s default + outer guard

Quick Start

# Install from npm
bun add -g voicelayer-mcp

# Prerequisites
brew install sox socat
pip3 install edge-tts
brew install whisper-cpp  # optional — local STT

# Download a whisper model (recommended)
mkdir -p ~/.cache/whisper
curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

Or install from source:

git clone https://github.com/EtanHey/voicelayer.git
cd voicelayer && bun install

Start the Daemon

# Option A: LaunchAgent (auto-start on login, auto-restart on crash)
./launchd/install.sh

# Option B: Manual
bun run src/mcp-server-daemon.ts

Disabling VoiceLayer

DISABLE_VOICELAYER=1 is a hard kill-switch for the MCP daemon.

# Install the LaunchAgent in a disabled state and sync the runtime daemon flag
DISABLE_VOICELAYER=1 ./launchd/install.sh

# Or edit the template-generated plist and add:
# <key>DISABLE_VOICELAYER</key>
# <string>1</string>

If the daemon is already running, create /tmp/.voicelayer-daemon-disabled and it will shut down within 5 seconds. ./launchd/install.sh also keeps that file in sync with DISABLE_VOICELAYER, so VoiceBar-launched daemons stay disabled too. To re-enable it, remove the env var from ~/Library/LaunchAgents/com.voicelayer.mcp-daemon.plist, delete /tmp/.voicelayer-daemon-disabled if present, and restart the agent:

launchctl kickstart -k "gui/$(id -u)/com.voicelayer.mcp-daemon"

Configure MCP Clients

Add to your .mcp.json (in any repo where you use Claude Code):

{
  "mcpServers": {
    "voicelayer": {
      "command": "socat",
      "args": ["STDIO", "UNIX-CONNECT:/tmp/voicelayer-mcp.sock"]
    }
  }
}

Or migrate all repos at once:

bash scripts/migrate-to-daemon.sh         # migrates every .mcp.json under ~/Gits
bash scripts/migrate-to-daemon.sh --dry-run  # preview without changes

Grant microphone access to your terminal (macOS: System Settings > Privacy > Microphone).

Voice Tools

Primary tools

Tool Behavior Blocking readOnly destructive idempotent
voice_speak TTS with auto-mode (announce/brief/consult/think), replay, toggle No false false true
voice_ask Speak question + record mic + transcribe response Yes false false false

Backward-compatible aliases

Alias Maps to idempotent
qa_voice_announce voice_speak(mode='announce') true
qa_voice_brief voice_speak(mode='brief') true
qa_voice_consult voice_speak(mode='consult') true
qa_voice_say voice_speak(mode='announce') true
qa_voice_think voice_speak(mode='think') false
qa_voice_replay voice_speak(replay_index=N) true
qa_voice_toggle voice_speak(enabled=bool) true
qa_voice_converse voice_ask false
qa_voice_ask voice_ask false

All 11 tools include MCP ToolAnnotations. No VoiceLayer tools are destructive. All have openWorldHint: false.

How voice_ask Works

  1. Waits for any playing voice_speak audio to finish
  2. Speaks the question via edge-tts (with retry on failure)
  3. Records mic at device native rate, resamples to 16kHz
  4. Silero VAD detects speech onset and silence end
  5. whisper.cpp transcribes locally (~200-400ms on Apple Silicon)
  6. Returns transcription to the AI agent

Reliability Features

  • PID lockfile (/tmp/voicelayer-mcp.pid): On startup, detects and kills any orphan MCP server from a previous session
  • edge-tts retry: Health check (cached 60s) + automatic retry with 30s hard timeout per attempt
  • Outer timeout guard: Promise.race wrapper around the entire voice_ask flow — if anything hangs, returns an error instead of blocking forever
  • Session booking: Lockfile mutex prevents mic conflicts between concurrent sessions

Recording Controls

Method How
Stop signal touch ~/.local/state/voicelayer/stop-{token}
VAD silence Configurable: quick (0.5s), standard (1.5s), thoughtful (2.5s)
Timeout 30s default, configurable 5-3600s per call
Push-to-talk press_to_talk: true — no VAD, stop on signal only

STT Backends

Backend Type Latency Setup
whisper.cpp Local (default) ~200-400ms brew install whisper-cpp + model download
Wispr Flow Cloud (fallback) ~500ms + network Set QA_VOICE_WISPR_KEY env var

Auto-detected. Override with QA_VOICE_STT_BACKEND=whisper|wispr|auto.

Voice Bar (macOS)

Floating SwiftUI widget providing visual feedback during voice interactions. Connects to the daemon via NDJSON over /tmp/voicelayer.sock.

  • Teleprompter with word-level highlighting and auto-scroll
  • Waveform visualization during recording
  • Expandable pill UI — collapses to dot after 5s idle
  • Draggable, position persisted across launches
  • Global hotkey: F5 (hold for push-to-talk)
bun add -g voicelayer-mcp
voicelayer hotkey install       # Install F5/Dictation -> F18 relay
voicelayer bar                  # Build and launch Voice Bar

Hotkey Notes:

  • Requires Input Monitoring permission (System Settings > Privacy & Security)
  • On keyboards where the physical key is Apple's Dictation key, voicelayer hotkey install installs a hidutil LaunchAgent to map F5/Dictation to VoiceBar's internal F18 relay.
  • The installer preserves non-VoiceBar hidutil mappings and is safe to rerun. Shift+F5 re-pastes the latest transcript.

Advanced: Voice Cloning

Three-tier TTS engine cascade for cloned voices:

  1. XTTS-v2 fine-tuned (cadence + timbre)
  2. F5-TTS MLX zero-shot (local, no daemon)
  3. Qwen3-TTS daemon (HTTP-based)
  4. edge-tts fallback (always available)
voicelayer extract <youtube-url>   # Extract voice samples
voicelayer clone <name>            # Build voice profile
voicelayer daemon --port 8880      # Run Qwen3-TTS server

The Qwen3 daemon now uses bearer auth from ~/.voicelayer/daemon.secret (created on first launch with mode 0600). The TypeScript bridge reads the same file automatically. Override the location with VOICELAYER_TTS_DAEMON_SECRET_FILE, VOICELAYER_TTS_AUTH_TOKEN_FILE, or voicelayer daemon --daemon-secret-file ... if you need a custom launcher path. The daemon only accepts Host: 127.0.0.1:8880 / Host: localhost:8880, rejects non-local Origin headers, and only reads reference_wav files that resolve under ~/.voicelayer/voices/.

Environment Variables

Variable Default Description
QA_VOICE_STT_BACKEND auto STT backend: whisper, wispr, or auto
QA_VOICE_WHISPER_MODEL auto-detected Path to whisper.cpp GGML model
QA_VOICE_WISPR_KEY -- Wispr Flow API key (cloud fallback)
QA_VOICE_TTS_VOICE en-US-JennyNeural edge-tts voice ID
QA_VOICE_TTS_RATE +0% Base speech rate
VOICELAYER_TTS_DAEMON_SECRET_FILE ~/.voicelayer/daemon.secret Preferred override for the shared Qwen3 daemon bearer secret file
VOICELAYER_TTS_AUTH_TOKEN_FILE ~/.voicelayer/daemon.secret Backward-compatible override for the shared Qwen3 daemon bearer secret file

Testing

bun test                              # 585 Bun tests + 1 skip (latest verified on PR #190 pre-push gate)
bash flow-bar/run_tests.sh            # 144 Swift tests for VoiceBar
git config core.hooksPath .githooks   # install repo pre-push hook once per clone (#181, #182)

Test coverage includes: MCP protocol framing, tool handlers, TTS synthesis + retry, VAD speech detection, session booking, process lock lifecycle, socket client reconnection, edge-tts health checks, schema validation, Hebrew STT eval baselines, daemon resilience, ToolAnnotations, SSML sanitization, and secure path hardening.

Recent Hardening (2026-04-27 → 2026-05-02)

One-week sprint focused on VoiceBar reliability and a recording corpus to fight STT regressions. Every line below traces to a merged PR.

Recording reliability

  • Recording control clickability restored — F6 socket controls remained interactive while the pill animated (#188).
  • Pill bottom anchor preserved during resize so the UI doesn't drift off-screen (#187).
  • Waveform animates again on real audio input + redundant "listening" copy removed (#184).
  • Waveform dynamic range restored above the silence gate (#185).
  • Custom VoiceBar install paths supported (no more hard-coded /Applications/VoiceBar.app) (#186).
  • VoiceBar transcription preserved through the recording RMS gate so quiet speech survives (#177).
  • Stale daemon restart detection — VoiceBar transcription resumes automatically after the daemon restarts (#183).

STT quality

  • No-input STT hallucinations suppressed (#189).
  • Zero-RMS audio ingestion watchdog catches a silent mic before whisper.cpp guesses (#178).

VoiceBar dictation corpus (Phase 1) — #190

  • Every successful VoiceBar dictation is archived under ~/.local/share/voicelayer/recordings/YYYY-MM-DD/<timestamp-id>/ with audio.wav + voicelayer-transcript.txt + metadata.json (schema v1, SHA-256 over WAV bytes).
  • Atomic rename + fsync so partial writes never appear in the corpus.
  • Cancelled or empty transcriptions are skipped — only real dictations land on disk.
  • Re-paste hotkey moved to Shift+F5; plain F5 is now the default record-start/stop activation through VoiceBar's F18 relay.

Test infrastructure

  • VoiceLayer pre-push regression gate (#181) plus exit-0 fix on the success path (#182).
  • voicelayer run_tests.sh orchestrator script unifies Bun + Swift + daemon-boot + Karabiner smoke runs (#180).
  • VoiceBar audio fixtures for golden-path STT regressions (#179).

Project Structure

voicelayer/
ā”œā”€ā”€ src/                          # TypeScript/Bun (18K lines, 69 files)
│   ā”œā”€ā”€ mcp-server-daemon.ts      # Singleton daemon entry point
│   ā”œā”€ā”€ mcp-server.ts             # Stdio MCP server (legacy)
│   ā”œā”€ā”€ mcp-daemon.ts             # Unix socket server (dual-protocol)
│   ā”œā”€ā”€ mcp-framing.ts            # Content-Length + NDJSON framing
│   ā”œā”€ā”€ mcp-handler.ts            # JSONRPC request router
│   ā”œā”€ā”€ process-lock.ts           # PID lockfile (orphan prevention)
│   ā”œā”€ā”€ handlers.ts               # Tool handler implementations
│   ā”œā”€ā”€ tts.ts                    # Multi-engine TTS with playback queue
│   ā”œā”€ā”€ tts-health.ts             # edge-tts health check + retry
│   ā”œā”€ā”€ input.ts                  # Mic recording + STT pipeline
│   ā”œā”€ā”€ vad.ts                    # Silero VAD (ONNX inference)
│   ā”œā”€ā”€ stt.ts                    # STT backend abstraction
│   ā”œā”€ā”€ socket-client.ts          # Voice Bar IPC (auto-reconnect)
│   ā”œā”€ā”€ session-booking.ts        # Lockfile mutex
│   ā”œā”€ā”€ paths.ts                  # Centralized path constants
│   └── __tests__/                # 536 tests across 48 files
ā”œā”€ā”€ flow-bar/                     # SwiftUI macOS app (1.9K lines, 9 files)
│   ā”œā”€ā”€ Sources/VoiceBar/         # App source
│   └── Tests/                    # Swift tests
ā”œā”€ā”€ scripts/
│   ā”œā”€ā”€ migrate-to-daemon.sh      # Batch .mcp.json migration
│   └── edge-tts-words.py         # Word-level TTS with timestamps
ā”œā”€ā”€ launchd/                      # macOS LaunchAgent auto-start
ā”œā”€ā”€ models/                       # Silero VAD ONNX model
└── package.json                  # v2.0.0

Platform Support

Platform TTS STT Recording Voice Bar
macOS edge-tts + afplay whisper.cpp (CoreML) sox SwiftUI app
Linux edge-tts + mpv/ffplay whisper.cpp sox --

Part of Golems

VoiceLayer is one of three open-source MCP servers in the Golems ecosystem:

Server What it does Tools
BrainLayer Persistent memory for AI agents — knowledge graph + hybrid search 12
VoiceLayer Voice I/O — local STT, neural TTS, F5 push-to-talk 11
cmuxLayer Terminal orchestration — spawn panes, read screens, coordinate agents 22

Pair with BrainLayer to remember voice conversations across sessions.

License

Apache-2.0

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured