MCP Servers

VoiceLayer

VoiceLayer MCP server enables AI coding assistants to speak and hear via local, on-device speech-to-text and text-to-speech, with no cloud dependencies.

README

VoiceLayer

Your AI agent can't hear you. VoiceLayer gives it ears and a voice.

Voice I/O for AI coding assistants. Press F5, speak to Claude Code, get on-device transcription in under 1.5 seconds. Your AI speaks back. Works with any MCP client.

  You ──🎤──> whisper.cpp ──> Claude Code ──> edge-tts ──🔊──> You
         STT (local)           MCP tools         TTS (free)

Local-first. Free. Open-source. No cloud APIs, no API keys, no data leaves your machine. Part of the Golems ecosystem.

Website | Docs | npm

VoiceLayer runs as a persistent singleton daemon on a Unix socket — every Claude session connects through a lightweight socat shim instead of spawning its own process. 2 canonical MCP tools plus 9 backward-compatible aliases ship with full ToolAnnotations.

Architecture

                  ┌─────────────────────────────────────┐
                  │         VoiceLayer Daemon            │
                  │     /tmp/voicelayer-mcp.sock         │
                  │                                      │
                  │  MCP JSONRPC ──> Tool Handlers       │
                  │  (Content-Length     ├── voice_speak  │
                  │   framing)          └── voice_ask    │
                  │                                      │
                  │  TTS: edge-tts (retry + 30s timeout) │
                  │  STT: whisper.cpp / Wispr Flow       │
                  │  VAD: Silero ONNX (speech detection)  │
                  │  IPC: Voice Bar ← NDJSON events      │
                  └──────────┬──────────────────────────┘
                             │ Unix socket
              ┌──────────────┼──────────────┐
              │              │              │
         Claude Code    Claude Code    Cursor/Codex
         (socat shim)  (socat shim)   (socat shim)

Why a daemon? The original design spawned a new Bun process per Claude session. With 17+ repos open, that meant 17 competing processes (700+ MB RAM), fighting over one Voice Bar socket, crashing edge-tts with PATH issues, and leaving orphans that never died. The daemon architecture — shipped in PRs #67-72 — replaced all of that with a single process and socat shims.

Metric	Before (spawn-per-session)	After (daemon)
Processes	N per session (17+ typical)	1 daemon + socat shims
RAM	~700 MB (17 x 41 MB)	~50 MB
Orphan cleanup	Manual `pkill`	PID lockfile auto-kills stale
edge-tts failures	Random (PATH, contention)	Retry with 30s hard timeout
voice_ask hang	Up to 300s (5 min!)	30s default + outer guard

Quick Start

# Install from npm
bun add -g voicelayer-mcp

# Prerequisites
brew install sox socat
pip3 install edge-tts
brew install whisper-cpp  # optional — local STT

# Download a whisper model (recommended)
mkdir -p ~/.cache/whisper
curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

Or install from source:

git clone https://github.com/EtanHey/voicelayer.git
cd voicelayer && bun install

Start the Daemon

# Option A: LaunchAgent (auto-start on login, auto-restart on crash)
./launchd/install.sh

# Option B: Manual
bun run src/mcp-server-daemon.ts

Disabling VoiceLayer

DISABLE_VOICELAYER=1 is a hard kill-switch for the MCP daemon.

# Install the LaunchAgent in a disabled state and sync the runtime daemon flag
DISABLE_VOICELAYER=1 ./launchd/install.sh

# Or edit the template-generated plist and add:
# <key>DISABLE_VOICELAYER</key>
# <string>1</string>

If the daemon is already running, create /tmp/.voicelayer-daemon-disabled and it will shut down within 5 seconds. ./launchd/install.sh also keeps that file in sync with DISABLE_VOICELAYER, so VoiceBar-launched daemons stay disabled too. To re-enable it, remove the env var from ~/Library/LaunchAgents/com.voicelayer.mcp-daemon.plist, delete /tmp/.voicelayer-daemon-disabled if present, and restart the agent:

launchctl kickstart -k "gui/$(id -u)/com.voicelayer.mcp-daemon"

Configure MCP Clients

Add to your .mcp.json (in any repo where you use Claude Code):

{
  "mcpServers": {
    "voicelayer": {
      "command": "socat",
      "args": ["STDIO", "UNIX-CONNECT:/tmp/voicelayer-mcp.sock"]
    }
  }
}

Or migrate all repos at once:

bash scripts/migrate-to-daemon.sh         # migrates every .mcp.json under ~/Gits
bash scripts/migrate-to-daemon.sh --dry-run  # preview without changes

Grant microphone access to your terminal (macOS: System Settings > Privacy > Microphone).

Voice Tools

Primary tools

Tool	Behavior	Blocking	readOnly	destructive	idempotent
`voice_speak`	TTS with auto-mode (announce/brief/consult/think), replay, toggle	No	false	false	true
`voice_ask`	Speak question + record mic + transcribe response	Yes	false	false	false

Backward-compatible aliases

Alias	Maps to	idempotent
`qa_voice_announce`	`voice_speak(mode='announce')`	true
`qa_voice_brief`	`voice_speak(mode='brief')`	true
`qa_voice_consult`	`voice_speak(mode='consult')`	true
`qa_voice_say`	`voice_speak(mode='announce')`	true
`qa_voice_think`	`voice_speak(mode='think')`	false
`qa_voice_replay`	`voice_speak(replay_index=N)`	true
`qa_voice_toggle`	`voice_speak(enabled=bool)`	true
`qa_voice_converse`	`voice_ask`	false
`qa_voice_ask`	`voice_ask`	false

All 11 tools include MCP ToolAnnotations. No VoiceLayer tools are destructive. All have openWorldHint: false.

How voice_ask Works

Waits for any playing voice_speak audio to finish
Speaks the question via edge-tts (with retry on failure)
Records mic at device native rate, resamples to 16kHz
Silero VAD detects speech onset and silence end
whisper.cpp transcribes locally (~200-400ms on Apple Silicon)
Returns transcription to the AI agent

Reliability Features

PID lockfile (/tmp/voicelayer-mcp.pid): On startup, detects and kills any orphan MCP server from a previous session
edge-tts retry: Health check (cached 60s) + automatic retry with 30s hard timeout per attempt
Outer timeout guard: Promise.race wrapper around the entire voice_ask flow — if anything hangs, returns an error instead of blocking forever
Session booking: Lockfile mutex prevents mic conflicts between concurrent sessions

Recording Controls

Method	How
Stop signal	`touch ~/.local/state/voicelayer/stop-{token}`
VAD silence	Configurable: quick (0.5s), standard (1.5s), thoughtful (2.5s)
Timeout	30s default, configurable 5-3600s per call
Push-to-talk	`press_to_talk: true` — no VAD, stop on signal only

STT Backends

Backend	Type	Latency	Setup
whisper.cpp	Local (default)	~200-400ms	`brew install whisper-cpp` + model download
Wispr Flow	Cloud (fallback)	~500ms + network	Set `QA_VOICE_WISPR_KEY` env var

Auto-detected. Override with QA_VOICE_STT_BACKEND=whisper|wispr|auto.

Voice Bar (macOS)

Floating SwiftUI widget providing visual feedback during voice interactions. Connects to the daemon via NDJSON over /tmp/voicelayer.sock.

Teleprompter with word-level highlighting and auto-scroll
Waveform visualization during recording
Expandable pill UI — collapses to dot after 5s idle
Draggable, position persisted across launches
Global hotkey: F5 (hold for push-to-talk)

bun add -g voicelayer-mcp
voicelayer hotkey install       # Install F5/Dictation -> F18 relay
voicelayer bar                  # Build and launch Voice Bar

Hotkey Notes:

Requires Input Monitoring permission (System Settings > Privacy & Security)
On keyboards where the physical key is Apple's Dictation key, voicelayer hotkey install installs a hidutil LaunchAgent to map F5/Dictation to VoiceBar's internal F18 relay.
The installer preserves non-VoiceBar hidutil mappings and is safe to rerun. Shift+F5 re-pastes the latest transcript.

Advanced: Voice Cloning

Three-tier TTS engine cascade for cloned voices:

XTTS-v2 fine-tuned (cadence + timbre)
F5-TTS MLX zero-shot (local, no daemon)
Qwen3-TTS daemon (HTTP-based)
edge-tts fallback (always available)

voicelayer extract <youtube-url>   # Extract voice samples
voicelayer clone <name>            # Build voice profile
voicelayer daemon --port 8880      # Run Qwen3-TTS server

The Qwen3 daemon now uses bearer auth from ~/.voicelayer/daemon.secret (created on first launch with mode 0600). The TypeScript bridge reads the same file automatically. Override the location with VOICELAYER_TTS_DAEMON_SECRET_FILE, VOICELAYER_TTS_AUTH_TOKEN_FILE, or voicelayer daemon --daemon-secret-file ... if you need a custom launcher path. The daemon only accepts Host: 127.0.0.1:8880 / Host: localhost:8880, rejects non-local Origin headers, and only reads reference_wav files that resolve under ~/.voicelayer/voices/.

Environment Variables

Variable	Default	Description
`QA_VOICE_STT_BACKEND`	`auto`	STT backend: `whisper`, `wispr`, or `auto`
`QA_VOICE_WHISPER_MODEL`	auto-detected	Path to whisper.cpp GGML model
`QA_VOICE_WISPR_KEY`	--	Wispr Flow API key (cloud fallback)
`QA_VOICE_TTS_VOICE`	`en-US-JennyNeural`	edge-tts voice ID
`QA_VOICE_TTS_RATE`	`+0%`	Base speech rate
`VOICELAYER_TTS_DAEMON_SECRET_FILE`	`~/.voicelayer/daemon.secret`	Preferred override for the shared Qwen3 daemon bearer secret file
`VOICELAYER_TTS_AUTH_TOKEN_FILE`	`~/.voicelayer/daemon.secret`	Backward-compatible override for the shared Qwen3 daemon bearer secret file

Testing

bun test                              # 585 Bun tests + 1 skip (latest verified on PR #190 pre-push gate)
bash flow-bar/run_tests.sh            # 144 Swift tests for VoiceBar
git config core.hooksPath .githooks   # install repo pre-push hook once per clone (#181, #182)

Test coverage includes: MCP protocol framing, tool handlers, TTS synthesis + retry, VAD speech detection, session booking, process lock lifecycle, socket client reconnection, edge-tts health checks, schema validation, Hebrew STT eval baselines, daemon resilience, ToolAnnotations, SSML sanitization, and secure path hardening.

Recent Hardening (2026-04-27 → 2026-05-02)

One-week sprint focused on VoiceBar reliability and a recording corpus to fight STT regressions. Every line below traces to a merged PR.

Recording reliability

Recording control clickability restored — F6 socket controls remained interactive while the pill animated (#188).
Pill bottom anchor preserved during resize so the UI doesn't drift off-screen (#187).
Waveform animates again on real audio input + redundant "listening" copy removed (#184).
Waveform dynamic range restored above the silence gate (#185).
Custom VoiceBar install paths supported (no more hard-coded /Applications/VoiceBar.app) (#186).
VoiceBar transcription preserved through the recording RMS gate so quiet speech survives (#177).
Stale daemon restart detection — VoiceBar transcription resumes automatically after the daemon restarts (#183).

STT quality

No-input STT hallucinations suppressed (#189).
Zero-RMS audio ingestion watchdog catches a silent mic before whisper.cpp guesses (#178).

VoiceBar dictation corpus (Phase 1) — #190

Every successful VoiceBar dictation is archived under ~/.local/share/voicelayer/recordings/YYYY-MM-DD/<timestamp-id>/ with audio.wav + voicelayer-transcript.txt + metadata.json (schema v1, SHA-256 over WAV bytes).
Atomic rename + fsync so partial writes never appear in the corpus.
Cancelled or empty transcriptions are skipped — only real dictations land on disk.
Re-paste hotkey moved to Shift+F5; plain F5 is now the default record-start/stop activation through VoiceBar's F18 relay.

Test infrastructure

VoiceLayer pre-push regression gate (#181) plus exit-0 fix on the success path (#182).
voicelayer run_tests.sh orchestrator script unifies Bun + Swift + daemon-boot + Karabiner smoke runs (#180).
VoiceBar audio fixtures for golden-path STT regressions (#179).

Project Structure

voicelayer/
├── src/                          # TypeScript/Bun (18K lines, 69 files)
│   ├── mcp-server-daemon.ts      # Singleton daemon entry point
│   ├── mcp-server.ts             # Stdio MCP server (legacy)
│   ├── mcp-daemon.ts             # Unix socket server (dual-protocol)
│   ├── mcp-framing.ts            # Content-Length + NDJSON framing
│   ├── mcp-handler.ts            # JSONRPC request router
│   ├── process-lock.ts           # PID lockfile (orphan prevention)
│   ├── handlers.ts               # Tool handler implementations
│   ├── tts.ts                    # Multi-engine TTS with playback queue
│   ├── tts-health.ts             # edge-tts health check + retry
│   ├── input.ts                  # Mic recording + STT pipeline
│   ├── vad.ts                    # Silero VAD (ONNX inference)
│   ├── stt.ts                    # STT backend abstraction
│   ├── socket-client.ts          # Voice Bar IPC (auto-reconnect)
│   ├── session-booking.ts        # Lockfile mutex
│   ├── paths.ts                  # Centralized path constants
│   └── __tests__/                # 536 tests across 48 files
├── flow-bar/                     # SwiftUI macOS app (1.9K lines, 9 files)
│   ├── Sources/VoiceBar/         # App source
│   └── Tests/                    # Swift tests
├── scripts/
│   ├── migrate-to-daemon.sh      # Batch .mcp.json migration
│   └── edge-tts-words.py         # Word-level TTS with timestamps
├── launchd/                      # macOS LaunchAgent auto-start
├── models/                       # Silero VAD ONNX model
└── package.json                  # v2.0.0

Platform Support

Platform	TTS	STT	Recording	Voice Bar
macOS	edge-tts + afplay	whisper.cpp (CoreML)	sox	SwiftUI app
Linux	edge-tts + mpv/ffplay	whisper.cpp	sox	--

Part of Golems

VoiceLayer is one of three open-source MCP servers in the Golems ecosystem:

Server	What it does	Tools
BrainLayer	Persistent memory for AI agents — knowledge graph + hybrid search	12
VoiceLayer	Voice I/O — local STT, neural TTS, F5 push-to-talk	11
cmuxLayer	Terminal orchestration — spawn panes, read screens, coordinate agents	22

Pair with BrainLayer to remember voice conversations across sessions.

License

Apache-2.0

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured