Augent
MCP server that turns any audio or video source into structured, searchable intelligence for agents, enabling download, transcription, semantic search, speaker identification, and more.
README
Augent — The Audio Layer for Agents
<p align="center"> <picture> <img src="./images/logo.png" width="600" alt="Augent"> </picture> </p>
<p align="center"> <strong>The wormhole stays open. Fully local. Fully private.</strong> </p>
<p align="center"> <a href="https://github.com/AugentDevs/Augent/actions/workflows/tests.yml"><img src="https://img.shields.io/github/actions/workflow/status/AugentDevs/Augent/tests.yml?label=build&style=for-the-badge" alt="Build"></a> <img src="https://img.shields.io/badge/dynamic/toml?url=https://raw.githubusercontent.com/AugentDevs/Augent/main/pyproject.toml&query=$.project.version&label=version&style=for-the-badge" alt="Version"> <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10+-3776AB.svg?style=for-the-badge" alt="Python 3.10+"></a> <a href="https://discord.com/invite/DNmaZtaE7b"><img src="https://img.shields.io/badge/Discord-Join-5865F2.svg?style=for-the-badge&logo=discord&logoColor=white" alt="Discord"></a> <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-blue.svg?style=for-the-badge" alt="License: MIT"></a> </p>
<p align="center"> <a href="#mcp-tools">MCP Tools</a> · <a href="#cli">CLI</a> · <a href="#claude-code-skill">Claude Code Skill</a> · <a href="#openclaw">OpenClaw</a> · <a href="#web-ui">Web UI</a> · <a href="https://augent.app">Website</a> · <a href="https://docs.augent.app">Docs</a> · <a href="CHANGELOG.md">Changelog</a> · <a href="mailto:hello@augent.app">Contact</a> </p>
If the answer is trapped in audio or video, this is the way through.
Augent turns any audio or video source into structured, searchable intelligence for agents. Give it URLs or files. It downloads, transcribes, indexes, and stores everything in persistent memory. Search by keyword or meaning, find where concepts intersect, identify speakers, generate chapters and notes, batch process entire libraries, and more. One install, full pipeline, entirely on your machine.
If you want the quality info from content without sitting through it, the fastest way, this is it.
Preferred setup: run the one-line installer in your terminal. One command installs Augent, all dependencies, and the MCP server config. Works on macOS and Linux. Windows: install via pip. Works with Claude Code, Codex, and any MCP client. New install? Start here: Getting started.
<br />
Install
curl -fsSL https://augent.app/install.sh | bash
Works on macOS and Linux. Installs everything automatically.
Windows: pip install "augent[all] @ git+https://github.com/AugentDevs/Augent.git"
<details> <summary><strong>What does the installer do?</strong></summary>
<br />
The installer is a single bash script (source). Every dependency is open source:
| Dependency | What it does |
|---|---|
| Python | Runtime |
| FFmpeg | Audio processing |
| yt-dlp | Media downloads |
| aria2 | Parallel downloads |
| espeak-ng | TTS phonemizer |
| faster-whisper | Speech-to-text |
| PyTorch | ML framework |
| sentence-transformers | Semantic search |
| pyannote-audio | Speaker diarization |
| Kokoro | Text-to-speech |
| Demucs | Audio source separation |
| FastAPI | Local web UI |
No background services. No telemetry. No sudo on macOS.
| Full breakdown | What each phase installs and why |
| Manual install | Step-by-step for macOS, Linux, and Windows |
| Uninstall | How to fully remove Augent |
</details>
<br />
<p align="center"> <picture> <img src="./images/install-demo.svg" alt="Install demo"> </picture> </p>
<br />
How it works (short)
graph TB
A["URL / File"] --> B["Download + Separate"]
B --> C["Transcribe"]
C --> D["Memory + Tag"]
D --> E["Search"]
D --> F["Analyze"]
D --> G["Export"]
style A fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
style B fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
style C fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
style D fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
style E fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
style F fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
style G fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
linkStyle default stroke:#00f060,stroke-width:1.5px
Project Structure
augent/
├── mcp.py # MCP server — 22 tools for agents
├── config.py # User configuration (~/.augent/config.yaml)
├── core.py # Transcription engine (faster-whisper)
├── search.py # Keyword search
├── embeddings.py # Semantic search, chapters, visual scoring
├── speakers.py # Speaker diarization (pyannote-audio)
├── separator.py # Audio source separation (Demucs v4)
├── tts.py # Text-to-speech (Kokoro)
├── memory.py # Three-layer memory (SQLite)
├── graph.py # Obsidian graph view (wikilinks, MOCs, frontmatter)
├── clips.py # CLI clip extraction (audio segments around matches)
├── export.py # Export formats (JSON, CSV, SRT, VTT, MD)
├── cli.py # CLI interface
└── web.py # Web UI (FastAPI)
<br />
MCP Tools
The primary way to use Augent. Any MCP client gets direct access to all tools.
Add to ~/.claude.json (global) or .mcp.json (project):
{
"mcpServers": {
"augent": {
"command": "augent-mcp"
}
}
}
Restart Claude Code. Run /mcp to verify connection.
| Tool | Description |
|---|---|
download_audio |
Download audio from video URLs at maximum speed (1,000+ supported sites) |
transcribe_audio |
Full transcription with metadata |
search_audio |
Find keywords with timestamps and context snippets |
deep_search |
Search audio by meaning, not just keywords (semantic search) |
take_notes |
Take notes from any URL with style presets |
chapters |
Auto-detect topic chapters in audio with timestamps |
batch_search |
Search multiple files in parallel, built for batch workflows and agent swarms |
text_to_speech |
Convert text to natural speech audio (Kokoro TTS, 54 voices, 9 languages) |
search_proximity |
Find where keywords appear near each other |
identify_speakers |
Identify who speaks when in audio (speaker diarization) |
separate_audio |
Isolate vocals from music and background noise (Demucs v4) |
clip_export |
Export a video clip from a URL for a specific time range |
highlights |
Export MP4 clips of specific moments, auto-pick the best or target exactly what you want |
tag |
Add, remove, or list tags on transcriptions for organized filtering |
visual |
Extract visual context from video at moments that matter (query, auto, or manual) |
rebuild_graph |
Rebuild Obsidian graph view data for all transcriptions |
search_memory |
Search across ALL stored transcriptions by keyword or meaning |
list_files |
List media files in a directory |
list_memories |
List stored transcriptions by title |
memory_stats |
View transcription memory statistics |
clear_memory |
Clear stored transcriptions |
spaces |
Download, check, or stop X/Twitter Spaces recordings |
<details> <summary>Example prompt</summary>
"Download these 10 podcasts and find every moment a host covers a product in a positive or unique way. Not just brand mentions, only real endorsements or life-changing recommendations. Give me the timestamps and exactly what they said: url1, url2, url3, url4, url5, url6, url7, url8, url9, url10"
<p align="center"> <picture> <img src="./images/pipeline.png" alt="Augent Pipeline — From URLs to insights in one prompt" width="100%"> </picture> </p>
</details>
<br />
CLI
Full CLI for terminal-based workflows. Works standalone or with any agent.
<picture> <img src="./images/cli-help.png" alt="Augent CLI"> </picture>
| Command | Description |
|---|---|
audio-downloader "URL" |
Download audio from video URL (speed-optimized) |
augent search audio.mp3 "keyword" |
Search for keywords |
augent transcribe audio.mp3 |
Full transcription |
augent proximity audio.mp3 "A" "B" |
Find keyword A near keyword B |
augent memory search "query" |
Search across all stored transcriptions |
augent memory stats |
View memory statistics |
augent memory list |
List stored transcriptions |
augent memory clear |
Clear memory |
<br />
Eyes & Ears
Someone explains their entire workflow in a video. Augent transcribes it, builds the workflow files, maps the sequencing, the decision points, the tool stack. Every piece structured into something an agent can act on.
But some steps are inherently visual. Augent detects where visual context is needed and exports multiple screenshots at those moments, giving the agent frame-by-frame context of the flow being described. Audio intelligence plus visual context equals a complete, replicable system.
graph TB
A["Expert explains workflow or automation"] --> B["Augent transcribes + structures"]
B --> C["Builds workflow files + sequencing"]
C --> D["Maps decision logic"]
C --> E["Identifies tools + platforms"]
C --> F["Flags visual gaps"]
F --> G["Exports screenshots for context"]
D --> H["Ready to run"]
E --> H
G --> H
style A fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
style B fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
style C fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
style D fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
style E fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
style F fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
style G fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
style H fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
<br />
X/Twitter Spaces
Download or live-record Twitter/X Spaces audio. Auto-detects whether a Space is live or ended and handles both. Live Spaces record from the current moment using ffmpeg, ended Spaces download the full recording via yt-dlp. All downloads run in the background so your agent keeps working.
One-time setup
X/Twitter requires authentication to access Space audio. Any account works, including a burner.
- Log into x.com in any browser
- Open DevTools (F12 or Cmd+Option+I) > Application > Cookies >
https://x.com - Copy the
auth_tokenandct0values - Create
~/.augent/auth.json:
{"auth_token": "PASTE_HERE", "ct0": "PASTE_HERE"}
Tokens are stored locally and only sent to Twitter's servers to fetch audio. Augent never posts, DMs, follows, or modifies anything on your account. To revoke access, log out of Twitter or delete ~/.augent/auth.json.
<br />
Claude Code Skill
Claude and Codex already know how to use Augent's tools from their descriptions. The skill adds advanced workflows on top: multi-step note-taking pipelines, auto-tagging rules, translation flows, quiz formatting, and optimal search strategies.
mkdir -p ~/.claude/skills/augent
curl -o ~/.claude/skills/augent/SKILL.md \
https://raw.githubusercontent.com/AugentDevs/Augent/main/skills/augent/SKILL.md
Works globally across all projects. One install, every conversation benefits.
<br />
OpenClaw
Augent is available as an OpenClaw skill on ClawHub.
Install via ClawHub:
npx clawhub@latest install augent
Or set up manually:
augent setup openclaw
If you installed Augent with curl -fsSL https://augent.app/install.sh | bash, OpenClaw is detected and configured automatically. These commands are only needed for manual setup or pip installs.
<br />
Obsidian Graph View
Every transcription builds a node. Every shared tag builds a connection. Your audio memory becomes a navigable knowledge graph, entirely automatic.
<picture> <img src="./images/obsidian-graph.png" alt="Augent knowledge graph in Obsidian"> </picture>
Every take_notes call, every transcription, every tag creates structure: YAML frontmatter, [[wikilinks]] between related content, and MOC hub files that cluster topics. Run rebuild_graph once to upgrade existing memory. The graph grows on its own from there.
Use Augent inside your main Obsidian vault, alongside your personal notes, journals, and projects. Everything compounds together. Full guide.
Using Claude Code or Codex with Obsidian? Set up augent-obsidian to make every
.txtand.mdfile on your Mac open directly in Obsidian, with automatic sync for external edits.
<br />
Multilingual
Augent transcribes audio in its original language with full accuracy, powered by OpenAI's Whisper, supporting 99 languages including Chinese, French, Spanish, Japanese, Arabic, Hindi, Korean, German, Russian, Portuguese, and many more. Language is auto-detected, no configuration needed. Translation to English is handled by Claude (or your LLM), producing far better translations than any local model.
- When a transcription returns a non-English language, the MCP response includes a translation offer
- Accepting stores a clean English
(eng)sibling file in memory alongside the original - Both the original and translated versions appear in the Memory Explorer
<br />
Model Sizes
tiny is the default. Handles everything from clean studio recordings to noisy field audio. Use small or above for heavy accents, poor audio, or lyrics.
| Model | Speed | Accuracy |
|---|---|---|
| tiny | Fastest | Excellent (default) |
| base | Fast | Excellent |
| small | Medium | Superior |
| medium | Slow | Outstanding |
| large | Slowest | Maximum |
<br />
Configuration
Customize defaults and disable tools you don't need via ~/.augent/config.yaml:
# ~/.augent/config.yaml
model_size: tiny # Default Whisper model
output_dir: ~/Downloads # Default download directory
notes_output_dir: ~/Desktop # Notes, clips, TTS output
clip_padding: 15 # Seconds of padding around clips
context_words: 25 # Words of context in search results
tts_voice: af_heart # Default TTS voice
tts_speed: 1.0 # TTS speed multiplier
disabled_tools: [] # Hide tools from MCP clients
Per-call arguments always override config. No config file needed, all values have sensible defaults.
<br />
Web UI
Local web interface. Runs 100% locally. No internet, no API keys, no data leaves your machine.
augent-web
Open: http://127.0.0.1:8282
Search view:
- Upload an audio file or paste a YouTube/video URL to download audio directly
- Enter keywords separated by commas
- Click SEARCH and results stream live with timestamps and context
- YouTube timestamps are automatically hyperlinked when the source is YouTube
Clip export:
- Click the film icon on any search result to create a visual region on the waveform, or drag on the waveform to select any range manually
- Nudge buttons (±1s / ±5s) on each edge for precise boundary adjustment
- Preview plays only the selected range so you hear exactly what will be exported
- Export MP4 downloads only the selected segment, not the full video
- Keyboard shortcuts:
Spacepreview,Enterexport,Escclose
Memory Explorer:
- Browse all stored transcriptions, including files transcribed via MCP or CLI. Every tool writes to the same memory.
- View full transcripts with clickable YouTube timestamps
- Delete individual transcriptions from memory
- Show Audio to reveal the source audio file in Finder
- Show Transcript to reveal the
.mdtranscript file in Finder. Drag it into a Claude Code session to run the full MCP pipeline on a previously transcribed file. - Share as HTML to download a self-contained, shareable transcript page
- Search across all memories by keyword to find matches across every transcription in your library
Source URL persistence: When audio is downloaded from any URL (YouTube, Twitter/X, TikTok, Instagram, SoundCloud, and 1000+ sites) the source URL is permanently stored by file hash. Any future search or transcription of that file, even weeks later or from a different path, automatically links back to the original source. No need to re-enter the URL.
<details> <summary>Web UI options</summary>
| Command | Description |
|---|---|
augent-web |
Start on port 8282 |
augent-web --port 8585 |
Custom port |
</details>
<picture> <img src="./images/webui-1.png" alt="Augent Web UI - Upload"> </picture> <picture> <img src="./images/webui-2.png" alt="Augent Web UI - Results"> </picture>
<br />
Star History
<a href="https://www.star-history.com/#AugentDevs/Augent&type=date&legend=top-left"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=AugentDevs/Augent&type=date&theme=dark&legend=top-left" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=AugentDevs/Augent&type=date&legend=top-left" /> <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=AugentDevs/Augent&type=date&legend=top-left" /> </picture> </a>
<br />
Contributing
PRs welcome. Open an issue for bugs or feature requests.
<br />
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.