claude-screen-mcp
Enables Claude to capture and analyze screen content across Windows, macOS, and Linux with zero native runtime dependencies.
README
claude-screen-mcp
Let Claude see your screen. A cross-platform MCP server for Windows + macOS + Linux with OCR and smart vision-diff. Zero native runtime deps.
Anthropic's official computer-use MCP for Claude Code is macOS-only today. This server fills the gap for Windows + Linux — and adds two things the official one doesn't have:
- 🔍 OCR so Claude can read screen text without spending vision tokens
- 📊 Smart vision-diff so 24/7 monitoring stays economical (skip frames that didn't change)
Quick start
# from source (until npm publish)
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build
# register with Claude Code
claude mcp add screen -- node "$(pwd)/dist/index.js"
# restart Claude Code, then ask:
# "Take a screenshot and tell me what's on my screen."
# "OCR my screen and tell me if there's an error message anywhere."
# "Watch my screen and ping me when the build finishes."
Tools (10 total)
| Tool | Since | What it does |
|---|---|---|
screenshot |
v0.1 | Capture full display, auto-resize for vision-token efficiency |
screenshot_region |
v0.1 | Capture an (x, y, w, h) region — way cheaper than full |
list_displays |
v0.1 | Enumerate connected monitors |
list_windows |
v0.1 | List visible windows with optional title filter |
read_screen_text |
v0.2 | OCR full screen or region (10-100× cheaper than vision) |
find_text_on_screen |
v0.2 | Search OCR'd text, return matching lines + bboxes |
screenshot_if_changed |
v0.3 | Capture only when perceptual hash distance ≥ threshold |
get_screen_diff |
v0.3 | Distance-only diff — no image returned |
wait_for_change |
v0.4 | Long-poll until the screen changes, then return one keyframe |
record_screen |
v0.4 | Capture N seconds at low fps and return deduplicated keyframes |
All 8 tools work the same way on Windows (PowerShell + System.Drawing), macOS (screencapture + osascript), and Linux (grim / scrot / import + wmctrl).
Use cases
1. Debug what you see — "Why is my React app not rendering? Look at the screen."
→ screenshot → Claude sees the error overlay → suggests fix.
2. Find something specific without burning vision tokens — "Is there an error message anywhere on my screen?"
→ find_text_on_screen("error") returns matching line + bbox → Claude calls screenshot_region on just that bbox.
3. Watch-while-task — "Ping me when this build finishes."
→ wait_for_change(timeoutMs=300000, threshold=12) — server blocks until the screen actually changes (or 5 min elapses), so the model only spends a turn when something happens. For longer watches, loop screenshot_if_changed(threshold=12) every 30s.
4. Show me what just happened — "I saw something flash by, replay the last 15 seconds."
→ record_screen(durationMs=15000, targetFps=2, maxFrames=6) returns up to 6 deduplicated keyframes covering that period in a single tool result — like rewinding a clip without storing video.
5. Read what's on screen, not look at it — "What does the current GitHub PR description say?"
→ read_screen_text returns plain text → 10-100× fewer tokens than vision.
Why this exists
Anthropic's official Claude Code computer-use MCP server (v2.1.85+) is macOS-only as of May 2026. Windows and Linux users have no first-party way to give Claude vision into their desktop.
This project fills the gap with three deliberate constraints:
- Zero native runtime deps — uses each OS's built-in screenshot tooling (PowerShell + System.Drawing on Win,
screencaptureon Mac,grim/scrot/importon Linux). Nonode-gyp, no postinstall flakiness, no platform-specific binaries to bundle. - Single responsibility — only screen capture (read-only). Keyboard / mouse control belongs in a separate server (different threat model). This means it can be safely autostarted in any Claude session without granting input control.
- Token-aware by design — auto-resize to
maxEdge=1600, JPEG/WebP support, region capture, OCR (skip vision entirely for text), and perceptual-hash diff (skip frames that didn't change).
Quality bar
Every release was reviewed by 3 specialized agents (code quality + silent-failure-hunter + security auditor) before tagging. Across v0.1 → v0.3, the audits caught 16 P0 issues that were fixed before any tag was pushed:
- v0.1: PowerShell
-EncodedCommandBOM / Mac+Linuxlist_displaysreturning fake data / tool errors swallowing stderr /displayIdargument injection / region OOM / output byte caps - v0.2:
SCREEN_MCP_OCR_LANGSsupply-chain injection (allowlist enforcement) / OCR worker timeout (was unbounded) / no-match token bomb / structured OCR diagnostics / SIGTERM handler - v0.3: cache size cap + LRU + 24h stale TTL / dHash channel assert (silent monitoring failure prevention) / cross-tool cache pollution fix /
CompareResult.reasonto distinguish first-call from real change - v0.4: Windows window-title mojibake (PowerShell OEM codepage → UTF-8) / Tesseract v6+ output schema (
blocks: truerequired for line bboxes; without itfind_text_on_screensilently returned 0 matches) /get_screen_diffmisleadingabove_thresholdreason / two new tools (wait_for_change,record_screen) for real-time-ish workflows
See the commit log for the full audit trail.
Configuration
Environment variables:
| Var | Default | Purpose |
|---|---|---|
SCREEN_MCP_LOG_LEVEL |
info |
debug / info / warn / error. Logs go to stderr. |
SCREEN_MCP_OCR_LANGS |
eng+chi_sim |
Plus-separated tesseract codes. Allowlist enforced to prevent supply-chain attacks. Allowed: eng, chi_sim, chi_tra, jpn, kor, fra, deu, spa, rus, ita, por, ara, nld, tur, vie, tha, hin, ben, ukr. |
First OCR call downloads ~40 MB of language models from cdn.jsdelivr.net. Subsequent calls reuse the cached worker.
Platform support
| Platform | Capture | Region | Displays | Windows | OCR | Vision-diff |
|---|---|---|---|---|---|---|
| Windows ≥ 10 | ✅ tested | ✅ | ✅ multi-display | ✅ | ✅ | ✅ |
| macOS ≥ 11 | ✅ code | ✅ | 🟡 stub (single only) | ✅ | ✅ | ✅ |
| Linux (X11 + Wayland) | ✅ code | ✅ | 🟡 stub (single only) | 🟡 needs wmctrl |
✅ | ✅ |
Windows is the maintainer's primary platform and has end-to-end test coverage. macOS / Linux paths are written and CI-built but not yet end-to-end tested by the maintainer — PRs and issue reports very welcome.
Security & privacy
- The server runs entirely locally. No screenshot data leaves your machine via this server. (Whatever LLM client connects controls where the image goes — that's the API call you authorized when registering the connector.)
- OCR text is untrusted input. Anything visible on your screen — notifications, web pages, chat windows, ads — gets passed to the LLM as a tool result. A malicious actor controlling something on your screen could embed prompt-injection content. Tool descriptions and output delimiters (
<screen_ocr>...</screen_ocr>) flag this clearly so downstream models can be guided to distrust. - Use
screenshot_regionwhen you don't need the whole screen. - Use
read_screen_textinstead ofscreenshotwhen you only need text — vastly fewer tokens and you're not exposing other windows that happen to be open.
Development
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build
node tests/e2e-wire.mjs # spawn server + drive JSON-RPC + verify all 8 tools
Roadmap
- v0.5 —
screenshot_window(title)precisely scoped to a window's bounds; macOS multi-display enumeration viasystem_profiler; Linux multi-display viaxrandr/wlr-randr; optional vendored tesseract models (SCREEN_MCP_OCR_LANG_PATH) for offline / air-gapped use - v1.0 — first-class MCPB bundle for one-click install via Claude Desktop
Why "real-time video" isn't a tool
MCP is request-response and each tool call costs an LLM turn (~1–3 s end-to-end). 24 fps streaming is physically impossible at that latency. Three substitutes cover the real use cases:
wait_for_change— like a human watching the screen and only saying something when it changesrecord_screen— like rewinding a short clip with the boring frames cut outscreenshot_if_changedin a loop — for sustained polling under your own pacing
Contributing
PRs especially welcome for:
- macOS multi-display enumeration (
system_profiler SPDisplaysDataType -jsonparsing) - Linux per-output capture (
grim -o,scrot --screen) screenshot_windowfor v0.4- Performance regressions if you find any
See CONTRIBUTING.md (TODO).
Sibling projects
Other small, single-author harnesses I publish under @lfzds4399-cpu — same MIT, same opinionated taste:
| Repo | One line |
|---|---|
| ai-council | Multi-voter consensus framework — disagreement blocks instead of being averaged away |
| domain-harness | Automated domain investing — discovery → AI Council valuation → registration → resale, with hard budget walls |
| methods-harness | SymPy-verified bilingual lesson pipeline for high-school calculus — one CLI re-renders everything |
| voice2ai | Hands-free dictation for Windows — push-to-talk into VS Code / Cursor / WeChat / browsers, 4 STT providers |
If claude-screen-mcp is useful, ⭐ the repo — it's the cheapest signal and it actually moves the needle.
License
MIT — see LICENSE.
中文 TL;DR
让 Claude 看到你的屏幕。MCP server,跨 Win/Mac/Linux,零原生依赖。
填补 Anthropic 官方 computer-use MCP 仅 macOS 的空白,外加 OCR(省 vision token 10-100x)和智能 vision-diff(让 24/7 监测在 token 经济上可行)。
8 个 tool(截屏 / 区域 / 列显示器 / 列窗口 / OCR / 找文字 / 智能截屏 / 看变化),跨平台一致。每个 release 都过了 3 agent 联合审核(代码质量 + silent failure + security),共修了 16 个 P0 才发出去。
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp && npm install && npm run build
claude mcp add screen -- node "$(pwd)/dist/index.js"
# 重启 Claude Code,然后说"截一张屏幕给我看"
中文 OCR 默认开启(eng+chi_sim),无需额外配置。
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.