claude-screen-mcp

claude-screen-mcp

Enables Claude to capture and analyze screen content across Windows, macOS, and Linux with zero native runtime dependencies.

Category
Visit Server

README

claude-screen-mcp

Let Claude see your screen. A cross-platform MCP server for Windows + macOS + Linux with OCR and smart vision-diff. Zero native runtime deps.

License: MIT Node MCP CI

Anthropic's official computer-use MCP for Claude Code is macOS-only today. This server fills the gap for Windows + Linux — and adds two things the official one doesn't have:

  • 🔍 OCR so Claude can read screen text without spending vision tokens
  • 📊 Smart vision-diff so 24/7 monitoring stays economical (skip frames that didn't change)

Quick start

# from source (until npm publish)
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build

# register with Claude Code
claude mcp add screen -- node "$(pwd)/dist/index.js"

# restart Claude Code, then ask:
# "Take a screenshot and tell me what's on my screen."
# "OCR my screen and tell me if there's an error message anywhere."
# "Watch my screen and ping me when the build finishes."

Tools (10 total)

Tool Since What it does
screenshot v0.1 Capture full display, auto-resize for vision-token efficiency
screenshot_region v0.1 Capture an (x, y, w, h) region — way cheaper than full
list_displays v0.1 Enumerate connected monitors
list_windows v0.1 List visible windows with optional title filter
read_screen_text v0.2 OCR full screen or region (10-100× cheaper than vision)
find_text_on_screen v0.2 Search OCR'd text, return matching lines + bboxes
screenshot_if_changed v0.3 Capture only when perceptual hash distance ≥ threshold
get_screen_diff v0.3 Distance-only diff — no image returned
wait_for_change v0.4 Long-poll until the screen changes, then return one keyframe
record_screen v0.4 Capture N seconds at low fps and return deduplicated keyframes

All 8 tools work the same way on Windows (PowerShell + System.Drawing), macOS (screencapture + osascript), and Linux (grim / scrot / import + wmctrl).


Use cases

1. Debug what you see"Why is my React app not rendering? Look at the screen."screenshot → Claude sees the error overlay → suggests fix.

2. Find something specific without burning vision tokens"Is there an error message anywhere on my screen?"find_text_on_screen("error") returns matching line + bbox → Claude calls screenshot_region on just that bbox.

3. Watch-while-task"Ping me when this build finishes."wait_for_change(timeoutMs=300000, threshold=12) — server blocks until the screen actually changes (or 5 min elapses), so the model only spends a turn when something happens. For longer watches, loop screenshot_if_changed(threshold=12) every 30s.

4. Show me what just happened"I saw something flash by, replay the last 15 seconds."record_screen(durationMs=15000, targetFps=2, maxFrames=6) returns up to 6 deduplicated keyframes covering that period in a single tool result — like rewinding a clip without storing video.

5. Read what's on screen, not look at it"What does the current GitHub PR description say?"read_screen_text returns plain text → 10-100× fewer tokens than vision.


Why this exists

Anthropic's official Claude Code computer-use MCP server (v2.1.85+) is macOS-only as of May 2026. Windows and Linux users have no first-party way to give Claude vision into their desktop.

This project fills the gap with three deliberate constraints:

  1. Zero native runtime deps — uses each OS's built-in screenshot tooling (PowerShell + System.Drawing on Win, screencapture on Mac, grim/scrot/import on Linux). No node-gyp, no postinstall flakiness, no platform-specific binaries to bundle.
  2. Single responsibility — only screen capture (read-only). Keyboard / mouse control belongs in a separate server (different threat model). This means it can be safely autostarted in any Claude session without granting input control.
  3. Token-aware by design — auto-resize to maxEdge=1600, JPEG/WebP support, region capture, OCR (skip vision entirely for text), and perceptual-hash diff (skip frames that didn't change).

Quality bar

Every release was reviewed by 3 specialized agents (code quality + silent-failure-hunter + security auditor) before tagging. Across v0.1 → v0.3, the audits caught 16 P0 issues that were fixed before any tag was pushed:

  • v0.1: PowerShell -EncodedCommand BOM / Mac+Linux list_displays returning fake data / tool errors swallowing stderr / displayId argument injection / region OOM / output byte caps
  • v0.2: SCREEN_MCP_OCR_LANGS supply-chain injection (allowlist enforcement) / OCR worker timeout (was unbounded) / no-match token bomb / structured OCR diagnostics / SIGTERM handler
  • v0.3: cache size cap + LRU + 24h stale TTL / dHash channel assert (silent monitoring failure prevention) / cross-tool cache pollution fix / CompareResult.reason to distinguish first-call from real change
  • v0.4: Windows window-title mojibake (PowerShell OEM codepage → UTF-8) / Tesseract v6+ output schema (blocks: true required for line bboxes; without it find_text_on_screen silently returned 0 matches) / get_screen_diff misleading above_threshold reason / two new tools (wait_for_change, record_screen) for real-time-ish workflows

See the commit log for the full audit trail.


Configuration

Environment variables:

Var Default Purpose
SCREEN_MCP_LOG_LEVEL info debug / info / warn / error. Logs go to stderr.
SCREEN_MCP_OCR_LANGS eng+chi_sim Plus-separated tesseract codes. Allowlist enforced to prevent supply-chain attacks. Allowed: eng, chi_sim, chi_tra, jpn, kor, fra, deu, spa, rus, ita, por, ara, nld, tur, vie, tha, hin, ben, ukr.

First OCR call downloads ~40 MB of language models from cdn.jsdelivr.net. Subsequent calls reuse the cached worker.


Platform support

Platform Capture Region Displays Windows OCR Vision-diff
Windows ≥ 10 ✅ tested ✅ multi-display
macOS ≥ 11 ✅ code 🟡 stub (single only)
Linux (X11 + Wayland) ✅ code 🟡 stub (single only) 🟡 needs wmctrl

Windows is the maintainer's primary platform and has end-to-end test coverage. macOS / Linux paths are written and CI-built but not yet end-to-end tested by the maintainer — PRs and issue reports very welcome.


Security & privacy

  • The server runs entirely locally. No screenshot data leaves your machine via this server. (Whatever LLM client connects controls where the image goes — that's the API call you authorized when registering the connector.)
  • OCR text is untrusted input. Anything visible on your screen — notifications, web pages, chat windows, ads — gets passed to the LLM as a tool result. A malicious actor controlling something on your screen could embed prompt-injection content. Tool descriptions and output delimiters (<screen_ocr>...</screen_ocr>) flag this clearly so downstream models can be guided to distrust.
  • Use screenshot_region when you don't need the whole screen.
  • Use read_screen_text instead of screenshot when you only need text — vastly fewer tokens and you're not exposing other windows that happen to be open.

Development

git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build
node tests/e2e-wire.mjs    # spawn server + drive JSON-RPC + verify all 8 tools

Roadmap

  • v0.5screenshot_window(title) precisely scoped to a window's bounds; macOS multi-display enumeration via system_profiler; Linux multi-display via xrandr / wlr-randr; optional vendored tesseract models (SCREEN_MCP_OCR_LANG_PATH) for offline / air-gapped use
  • v1.0 — first-class MCPB bundle for one-click install via Claude Desktop

Why "real-time video" isn't a tool

MCP is request-response and each tool call costs an LLM turn (~1–3 s end-to-end). 24 fps streaming is physically impossible at that latency. Three substitutes cover the real use cases:

  • wait_for_change — like a human watching the screen and only saying something when it changes
  • record_screen — like rewinding a short clip with the boring frames cut out
  • screenshot_if_changed in a loop — for sustained polling under your own pacing

Contributing

PRs especially welcome for:

  • macOS multi-display enumeration (system_profiler SPDisplaysDataType -json parsing)
  • Linux per-output capture (grim -o, scrot --screen)
  • screenshot_window for v0.4
  • Performance regressions if you find any

See CONTRIBUTING.md (TODO).


Sibling projects

Other small, single-author harnesses I publish under @lfzds4399-cpu — same MIT, same opinionated taste:

Repo One line
ai-council Multi-voter consensus framework — disagreement blocks instead of being averaged away
domain-harness Automated domain investing — discovery → AI Council valuation → registration → resale, with hard budget walls
methods-harness SymPy-verified bilingual lesson pipeline for high-school calculus — one CLI re-renders everything
voice2ai Hands-free dictation for Windows — push-to-talk into VS Code / Cursor / WeChat / browsers, 4 STT providers

If claude-screen-mcp is useful, ⭐ the repo — it's the cheapest signal and it actually moves the needle.


License

MIT — see LICENSE.


中文 TL;DR

让 Claude 看到你的屏幕。MCP server,跨 Win/Mac/Linux,零原生依赖。

填补 Anthropic 官方 computer-use MCP 仅 macOS 的空白,外加 OCR(省 vision token 10-100x)和智能 vision-diff(让 24/7 监测在 token 经济上可行)。

8 个 tool(截屏 / 区域 / 列显示器 / 列窗口 / OCR / 找文字 / 智能截屏 / 看变化),跨平台一致。每个 release 都过了 3 agent 联合审核(代码质量 + silent failure + security),共修了 16 个 P0 才发出去。

git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp && npm install && npm run build
claude mcp add screen -- node "$(pwd)/dist/index.js"
# 重启 Claude Code,然后说"截一张屏幕给我看"

中文 OCR 默认开启(eng+chi_sim),无需额外配置。

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured