VoiceLayer
VoiceLayer MCP server enables AI coding assistants to speak and hear via local, on-device speech-to-text and text-to-speech, with no cloud dependencies.
README
VoiceLayer
Your AI agent can't hear you. VoiceLayer gives it ears and a voice.
Voice I/O for AI coding assistants. Press F5, speak to Claude Code, get on-device transcription in under 1.5 seconds. Your AI speaks back. Works with any MCP client.
You āāš¤āā> whisper.cpp āā> Claude Code āā> edge-tts āāšāā> You
STT (local) MCP tools TTS (free)
Local-first. Free. Open-source. No cloud APIs, no API keys, no data leaves your machine. Part of the Golems ecosystem.
VoiceLayer runs as a persistent singleton daemon on a Unix socket ā every Claude session connects through a lightweight socat shim instead of spawning its own process. 2 canonical MCP tools plus 9 backward-compatible aliases ship with full ToolAnnotations.
Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā VoiceLayer Daemon ā
ā /tmp/voicelayer-mcp.sock ā
ā ā
ā MCP JSONRPC āā> Tool Handlers ā
ā (Content-Length āāā voice_speak ā
ā framing) āāā voice_ask ā
ā ā
ā TTS: edge-tts (retry + 30s timeout) ā
ā STT: whisper.cpp / Wispr Flow ā
ā VAD: Silero ONNX (speech detection) ā
ā IPC: Voice Bar ā NDJSON events ā
āāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Unix socket
āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāā
ā ā ā
Claude Code Claude Code Cursor/Codex
(socat shim) (socat shim) (socat shim)
Why a daemon? The original design spawned a new Bun process per Claude session. With 17+ repos open, that meant 17 competing processes (700+ MB RAM), fighting over one Voice Bar socket, crashing edge-tts with PATH issues, and leaving orphans that never died. The daemon architecture ā shipped in PRs #67-72 ā replaced all of that with a single process and socat shims.
| Metric | Before (spawn-per-session) | After (daemon) |
|---|---|---|
| Processes | N per session (17+ typical) | 1 daemon + socat shims |
| RAM | ~700 MB (17 x 41 MB) | ~50 MB |
| Orphan cleanup | Manual pkill |
PID lockfile auto-kills stale |
| edge-tts failures | Random (PATH, contention) | Retry with 30s hard timeout |
| voice_ask hang | Up to 300s (5 min!) | 30s default + outer guard |
Quick Start
# Install from npm
bun add -g voicelayer-mcp
# Prerequisites
brew install sox socat
pip3 install edge-tts
brew install whisper-cpp # optional ā local STT
# Download a whisper model (recommended)
mkdir -p ~/.cache/whisper
curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
Or install from source:
git clone https://github.com/EtanHey/voicelayer.git
cd voicelayer && bun install
Start the Daemon
# Option A: LaunchAgent (auto-start on login, auto-restart on crash)
./launchd/install.sh
# Option B: Manual
bun run src/mcp-server-daemon.ts
Disabling VoiceLayer
DISABLE_VOICELAYER=1 is a hard kill-switch for the MCP daemon.
# Install the LaunchAgent in a disabled state and sync the runtime daemon flag
DISABLE_VOICELAYER=1 ./launchd/install.sh
# Or edit the template-generated plist and add:
# <key>DISABLE_VOICELAYER</key>
# <string>1</string>
If the daemon is already running, create /tmp/.voicelayer-daemon-disabled and it will shut down within 5 seconds. ./launchd/install.sh also keeps that file in sync with DISABLE_VOICELAYER, so VoiceBar-launched daemons stay disabled too. To re-enable it, remove the env var from ~/Library/LaunchAgents/com.voicelayer.mcp-daemon.plist, delete /tmp/.voicelayer-daemon-disabled if present, and restart the agent:
launchctl kickstart -k "gui/$(id -u)/com.voicelayer.mcp-daemon"
Configure MCP Clients
Add to your .mcp.json (in any repo where you use Claude Code):
{
"mcpServers": {
"voicelayer": {
"command": "socat",
"args": ["STDIO", "UNIX-CONNECT:/tmp/voicelayer-mcp.sock"]
}
}
}
Or migrate all repos at once:
bash scripts/migrate-to-daemon.sh # migrates every .mcp.json under ~/Gits
bash scripts/migrate-to-daemon.sh --dry-run # preview without changes
Grant microphone access to your terminal (macOS: System Settings > Privacy > Microphone).
Voice Tools
Primary tools
| Tool | Behavior | Blocking | readOnly | destructive | idempotent |
|---|---|---|---|---|---|
voice_speak |
TTS with auto-mode (announce/brief/consult/think), replay, toggle | No | false | false | true |
voice_ask |
Speak question + record mic + transcribe response | Yes | false | false | false |
Backward-compatible aliases
| Alias | Maps to | idempotent |
|---|---|---|
qa_voice_announce |
voice_speak(mode='announce') |
true |
qa_voice_brief |
voice_speak(mode='brief') |
true |
qa_voice_consult |
voice_speak(mode='consult') |
true |
qa_voice_say |
voice_speak(mode='announce') |
true |
qa_voice_think |
voice_speak(mode='think') |
false |
qa_voice_replay |
voice_speak(replay_index=N) |
true |
qa_voice_toggle |
voice_speak(enabled=bool) |
true |
qa_voice_converse |
voice_ask |
false |
qa_voice_ask |
voice_ask |
false |
All 11 tools include MCP ToolAnnotations. No VoiceLayer tools are destructive. All have openWorldHint: false.
How voice_ask Works
- Waits for any playing
voice_speakaudio to finish - Speaks the question via edge-tts (with retry on failure)
- Records mic at device native rate, resamples to 16kHz
- Silero VAD detects speech onset and silence end
- whisper.cpp transcribes locally (~200-400ms on Apple Silicon)
- Returns transcription to the AI agent
Reliability Features
- PID lockfile (
/tmp/voicelayer-mcp.pid): On startup, detects and kills any orphan MCP server from a previous session - edge-tts retry: Health check (cached 60s) + automatic retry with 30s hard timeout per attempt
- Outer timeout guard:
Promise.racewrapper around the entire voice_ask flow ā if anything hangs, returns an error instead of blocking forever - Session booking: Lockfile mutex prevents mic conflicts between concurrent sessions
Recording Controls
| Method | How |
|---|---|
| Stop signal | touch ~/.local/state/voicelayer/stop-{token} |
| VAD silence | Configurable: quick (0.5s), standard (1.5s), thoughtful (2.5s) |
| Timeout | 30s default, configurable 5-3600s per call |
| Push-to-talk | press_to_talk: true ā no VAD, stop on signal only |
STT Backends
| Backend | Type | Latency | Setup |
|---|---|---|---|
| whisper.cpp | Local (default) | ~200-400ms | brew install whisper-cpp + model download |
| Wispr Flow | Cloud (fallback) | ~500ms + network | Set QA_VOICE_WISPR_KEY env var |
Auto-detected. Override with QA_VOICE_STT_BACKEND=whisper|wispr|auto.
Voice Bar (macOS)
Floating SwiftUI widget providing visual feedback during voice interactions. Connects to the daemon via NDJSON over /tmp/voicelayer.sock.
- Teleprompter with word-level highlighting and auto-scroll
- Waveform visualization during recording
- Expandable pill UI ā collapses to dot after 5s idle
- Draggable, position persisted across launches
- Global hotkey: F5 (hold for push-to-talk)
bun add -g voicelayer-mcp
voicelayer hotkey install # Install F5/Dictation -> F18 relay
voicelayer bar # Build and launch Voice Bar
Hotkey Notes:
- Requires Input Monitoring permission (System Settings > Privacy & Security)
- On keyboards where the physical key is Apple's Dictation key,
voicelayer hotkey installinstalls ahidutilLaunchAgent to map F5/Dictation to VoiceBar's internal F18 relay. - The installer preserves non-VoiceBar
hidutilmappings and is safe to rerun.Shift+F5re-pastes the latest transcript.
Advanced: Voice Cloning
Three-tier TTS engine cascade for cloned voices:
- XTTS-v2 fine-tuned (cadence + timbre)
- F5-TTS MLX zero-shot (local, no daemon)
- Qwen3-TTS daemon (HTTP-based)
- edge-tts fallback (always available)
voicelayer extract <youtube-url> # Extract voice samples
voicelayer clone <name> # Build voice profile
voicelayer daemon --port 8880 # Run Qwen3-TTS server
The Qwen3 daemon now uses bearer auth from ~/.voicelayer/daemon.secret
(created on first launch with mode 0600). The TypeScript bridge reads the
same file automatically. Override the location with
VOICELAYER_TTS_DAEMON_SECRET_FILE,
VOICELAYER_TTS_AUTH_TOKEN_FILE, or
voicelayer daemon --daemon-secret-file ... if you need a custom launcher
path. The daemon only accepts Host: 127.0.0.1:8880 /
Host: localhost:8880, rejects non-local Origin headers, and only reads
reference_wav files that resolve under ~/.voicelayer/voices/.
Environment Variables
| Variable | Default | Description |
|---|---|---|
QA_VOICE_STT_BACKEND |
auto |
STT backend: whisper, wispr, or auto |
QA_VOICE_WHISPER_MODEL |
auto-detected | Path to whisper.cpp GGML model |
QA_VOICE_WISPR_KEY |
-- | Wispr Flow API key (cloud fallback) |
QA_VOICE_TTS_VOICE |
en-US-JennyNeural |
edge-tts voice ID |
QA_VOICE_TTS_RATE |
+0% |
Base speech rate |
VOICELAYER_TTS_DAEMON_SECRET_FILE |
~/.voicelayer/daemon.secret |
Preferred override for the shared Qwen3 daemon bearer secret file |
VOICELAYER_TTS_AUTH_TOKEN_FILE |
~/.voicelayer/daemon.secret |
Backward-compatible override for the shared Qwen3 daemon bearer secret file |
Testing
bun test # 585 Bun tests + 1 skip (latest verified on PR #190 pre-push gate)
bash flow-bar/run_tests.sh # 144 Swift tests for VoiceBar
git config core.hooksPath .githooks # install repo pre-push hook once per clone (#181, #182)
Test coverage includes: MCP protocol framing, tool handlers, TTS synthesis + retry, VAD speech detection, session booking, process lock lifecycle, socket client reconnection, edge-tts health checks, schema validation, Hebrew STT eval baselines, daemon resilience, ToolAnnotations, SSML sanitization, and secure path hardening.
Recent Hardening (2026-04-27 ā 2026-05-02)
One-week sprint focused on VoiceBar reliability and a recording corpus to fight STT regressions. Every line below traces to a merged PR.
Recording reliability
- Recording control clickability restored ā F6 socket controls remained interactive while the pill animated (#188).
- Pill bottom anchor preserved during resize so the UI doesn't drift off-screen (#187).
- Waveform animates again on real audio input + redundant "listening" copy removed (#184).
- Waveform dynamic range restored above the silence gate (#185).
- Custom VoiceBar install paths supported (no more hard-coded
/Applications/VoiceBar.app) (#186). - VoiceBar transcription preserved through the recording RMS gate so quiet speech survives (#177).
- Stale daemon restart detection ā VoiceBar transcription resumes automatically after the daemon restarts (#183).
STT quality
- No-input STT hallucinations suppressed (#189).
- Zero-RMS audio ingestion watchdog catches a silent mic before whisper.cpp guesses (#178).
VoiceBar dictation corpus (Phase 1) ā #190
- Every successful VoiceBar dictation is archived under
~/.local/share/voicelayer/recordings/YYYY-MM-DD/<timestamp-id>/withaudio.wav+voicelayer-transcript.txt+metadata.json(schema v1, SHA-256 over WAV bytes). - Atomic rename + fsync so partial writes never appear in the corpus.
- Cancelled or empty transcriptions are skipped ā only real dictations land on disk.
- Re-paste hotkey moved to
Shift+F5; plainF5is now the default record-start/stop activation through VoiceBar's F18 relay.
Test infrastructure
- VoiceLayer pre-push regression gate (#181) plus exit-0 fix on the success path (#182).
voicelayer run_tests.shorchestrator script unifies Bun + Swift + daemon-boot + Karabiner smoke runs (#180).- VoiceBar audio fixtures for golden-path STT regressions (#179).
Project Structure
voicelayer/
āāā src/ # TypeScript/Bun (18K lines, 69 files)
ā āāā mcp-server-daemon.ts # Singleton daemon entry point
ā āāā mcp-server.ts # Stdio MCP server (legacy)
ā āāā mcp-daemon.ts # Unix socket server (dual-protocol)
ā āāā mcp-framing.ts # Content-Length + NDJSON framing
ā āāā mcp-handler.ts # JSONRPC request router
ā āāā process-lock.ts # PID lockfile (orphan prevention)
ā āāā handlers.ts # Tool handler implementations
ā āāā tts.ts # Multi-engine TTS with playback queue
ā āāā tts-health.ts # edge-tts health check + retry
ā āāā input.ts # Mic recording + STT pipeline
ā āāā vad.ts # Silero VAD (ONNX inference)
ā āāā stt.ts # STT backend abstraction
ā āāā socket-client.ts # Voice Bar IPC (auto-reconnect)
ā āāā session-booking.ts # Lockfile mutex
ā āāā paths.ts # Centralized path constants
ā āāā __tests__/ # 536 tests across 48 files
āāā flow-bar/ # SwiftUI macOS app (1.9K lines, 9 files)
ā āāā Sources/VoiceBar/ # App source
ā āāā Tests/ # Swift tests
āāā scripts/
ā āāā migrate-to-daemon.sh # Batch .mcp.json migration
ā āāā edge-tts-words.py # Word-level TTS with timestamps
āāā launchd/ # macOS LaunchAgent auto-start
āāā models/ # Silero VAD ONNX model
āāā package.json # v2.0.0
Platform Support
| Platform | TTS | STT | Recording | Voice Bar |
|---|---|---|---|---|
| macOS | edge-tts + afplay | whisper.cpp (CoreML) | sox | SwiftUI app |
| Linux | edge-tts + mpv/ffplay | whisper.cpp | sox | -- |
Part of Golems
VoiceLayer is one of three open-source MCP servers in the Golems ecosystem:
| Server | What it does | Tools |
|---|---|---|
| BrainLayer | Persistent memory for AI agents ā knowledge graph + hybrid search | 12 |
| VoiceLayer | Voice I/O ā local STT, neural TTS, F5 push-to-talk | 11 |
| cmuxLayer | Terminal orchestration ā spawn panes, read screens, coordinate agents | 22 |
Pair with BrainLayer to remember voice conversations across sessions.
License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.