MCP Servers

claude-screen-mcp

Enables Claude to capture and analyze screen content across Windows, macOS, and Linux with zero native runtime dependencies.

README

claude-screen-mcp

Let Claude see your screen. A cross-platform MCP server for Windows + macOS + Linux with OCR and smart vision-diff. Zero native runtime deps.

Anthropic's official computer-use MCP for Claude Code is macOS-only today. This server fills the gap for Windows + Linux — and adds two things the official one doesn't have:

🔍 OCR so Claude can read screen text without spending vision tokens
📊 Smart vision-diff so 24/7 monitoring stays economical (skip frames that didn't change)

Quick start

# from source (until npm publish)
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build

# register with Claude Code
claude mcp add screen -- node "$(pwd)/dist/index.js"

# restart Claude Code, then ask:
# "Take a screenshot and tell me what's on my screen."
# "OCR my screen and tell me if there's an error message anywhere."
# "Watch my screen and ping me when the build finishes."

Tools (10 total)

Tool	Since	What it does
`screenshot`	v0.1	Capture full display, auto-resize for vision-token efficiency
`screenshot_region`	v0.1	Capture an `(x, y, w, h)` region — way cheaper than full
`list_displays`	v0.1	Enumerate connected monitors
`list_windows`	v0.1	List visible windows with optional title filter
`read_screen_text`	v0.2	OCR full screen or region (10-100× cheaper than vision)
`find_text_on_screen`	v0.2	Search OCR'd text, return matching lines + bboxes
`screenshot_if_changed`	v0.3	Capture only when perceptual hash distance ≥ threshold
`get_screen_diff`	v0.3	Distance-only diff — no image returned
`wait_for_change`	v0.4	Long-poll until the screen changes, then return one keyframe
`record_screen`	v0.4	Capture N seconds at low fps and return deduplicated keyframes

All 8 tools work the same way on Windows (PowerShell + System.Drawing), macOS (screencapture + osascript), and Linux (grim / scrot / import + wmctrl).

Use cases

1. Debug what you see — "Why is my React app not rendering? Look at the screen." → screenshot → Claude sees the error overlay → suggests fix.

2. Find something specific without burning vision tokens — "Is there an error message anywhere on my screen?" → find_text_on_screen("error") returns matching line + bbox → Claude calls screenshot_region on just that bbox.

3. Watch-while-task — "Ping me when this build finishes." → wait_for_change(timeoutMs=300000, threshold=12) — server blocks until the screen actually changes (or 5 min elapses), so the model only spends a turn when something happens. For longer watches, loop screenshot_if_changed(threshold=12) every 30s.

4. Show me what just happened — "I saw something flash by, replay the last 15 seconds." → record_screen(durationMs=15000, targetFps=2, maxFrames=6) returns up to 6 deduplicated keyframes covering that period in a single tool result — like rewinding a clip without storing video.

5. Read what's on screen, not look at it — "What does the current GitHub PR description say?" → read_screen_text returns plain text → 10-100× fewer tokens than vision.

Why this exists

Anthropic's official Claude Code computer-use MCP server (v2.1.85+) is macOS-only as of May 2026. Windows and Linux users have no first-party way to give Claude vision into their desktop.

This project fills the gap with three deliberate constraints:

Zero native runtime deps — uses each OS's built-in screenshot tooling (PowerShell + System.Drawing on Win, screencapture on Mac, grim/scrot/import on Linux). No node-gyp, no postinstall flakiness, no platform-specific binaries to bundle.
Single responsibility — only screen capture (read-only). Keyboard / mouse control belongs in a separate server (different threat model). This means it can be safely autostarted in any Claude session without granting input control.
Token-aware by design — auto-resize to maxEdge=1600, JPEG/WebP support, region capture, OCR (skip vision entirely for text), and perceptual-hash diff (skip frames that didn't change).

Quality bar

Every release was reviewed by 3 specialized agents (code quality + silent-failure-hunter + security auditor) before tagging. Across v0.1 → v0.3, the audits caught 16 P0 issues that were fixed before any tag was pushed:

v0.1: PowerShell -EncodedCommand BOM / Mac+Linux list_displays returning fake data / tool errors swallowing stderr / displayId argument injection / region OOM / output byte caps
v0.2: SCREEN_MCP_OCR_LANGS supply-chain injection (allowlist enforcement) / OCR worker timeout (was unbounded) / no-match token bomb / structured OCR diagnostics / SIGTERM handler
v0.3: cache size cap + LRU + 24h stale TTL / dHash channel assert (silent monitoring failure prevention) / cross-tool cache pollution fix / CompareResult.reason to distinguish first-call from real change
v0.4: Windows window-title mojibake (PowerShell OEM codepage → UTF-8) / Tesseract v6+ output schema (blocks: true required for line bboxes; without it find_text_on_screen silently returned 0 matches) / get_screen_diff misleading above_threshold reason / two new tools (wait_for_change, record_screen) for real-time-ish workflows

See the commit log for the full audit trail.

Configuration

Environment variables:

Var	Default	Purpose
`SCREEN_MCP_LOG_LEVEL`	`info`	`debug` / `info` / `warn` / `error`. Logs go to stderr.
`SCREEN_MCP_OCR_LANGS`	`eng+chi_sim`	Plus-separated tesseract codes. Allowlist enforced to prevent supply-chain attacks. Allowed: `eng`, `chi_sim`, `chi_tra`, `jpn`, `kor`, `fra`, `deu`, `spa`, `rus`, `ita`, `por`, `ara`, `nld`, `tur`, `vie`, `tha`, `hin`, `ben`, `ukr`.

First OCR call downloads ~40 MB of language models from cdn.jsdelivr.net. Subsequent calls reuse the cached worker.

Platform support

Platform	Capture	Region	Displays	Windows	OCR	Vision-diff
Windows ≥ 10	✅ tested	✅	✅ multi-display	✅	✅	✅
macOS ≥ 11	✅ code	✅	🟡 stub (single only)	✅	✅	✅
Linux (X11 + Wayland)	✅ code	✅	🟡 stub (single only)	🟡 needs `wmctrl`	✅	✅

Windows is the maintainer's primary platform and has end-to-end test coverage. macOS / Linux paths are written and CI-built but not yet end-to-end tested by the maintainer — PRs and issue reports very welcome.

Security & privacy

The server runs entirely locally. No screenshot data leaves your machine via this server. (Whatever LLM client connects controls where the image goes — that's the API call you authorized when registering the connector.)
OCR text is untrusted input. Anything visible on your screen — notifications, web pages, chat windows, ads — gets passed to the LLM as a tool result. A malicious actor controlling something on your screen could embed prompt-injection content. Tool descriptions and output delimiters (<screen_ocr>...</screen_ocr>) flag this clearly so downstream models can be guided to distrust.
Use screenshot_region when you don't need the whole screen.
Use read_screen_text instead of screenshot when you only need text — vastly fewer tokens and you're not exposing other windows that happen to be open.

Development

git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build
node tests/e2e-wire.mjs    # spawn server + drive JSON-RPC + verify all 8 tools

Roadmap

v0.5 — screenshot_window(title) precisely scoped to a window's bounds; macOS multi-display enumeration via system_profiler; Linux multi-display via xrandr / wlr-randr; optional vendored tesseract models (SCREEN_MCP_OCR_LANG_PATH) for offline / air-gapped use
v1.0 — first-class MCPB bundle for one-click install via Claude Desktop

Why "real-time video" isn't a tool

MCP is request-response and each tool call costs an LLM turn (~1–3 s end-to-end). 24 fps streaming is physically impossible at that latency. Three substitutes cover the real use cases:

wait_for_change — like a human watching the screen and only saying something when it changes
record_screen — like rewinding a short clip with the boring frames cut out
screenshot_if_changed in a loop — for sustained polling under your own pacing

Contributing

PRs especially welcome for:

macOS multi-display enumeration (system_profiler SPDisplaysDataType -json parsing)
Linux per-output capture (grim -o, scrot --screen)
screenshot_window for v0.4
Performance regressions if you find any

See CONTRIBUTING.md (TODO).

Sibling projects

Other small, single-author harnesses I publish under @lfzds4399-cpu — same MIT, same opinionated taste:

Repo	One line
ai-council	Multi-voter consensus framework — disagreement blocks instead of being averaged away
domain-harness	Automated domain investing — discovery → AI Council valuation → registration → resale, with hard budget walls
methods-harness	SymPy-verified bilingual lesson pipeline for high-school calculus — one CLI re-renders everything
voice2ai	Hands-free dictation for Windows — push-to-talk into VS Code / Cursor / WeChat / browsers, 4 STT providers

If claude-screen-mcp is useful, ⭐ the repo — it's the cheapest signal and it actually moves the needle.

License

MIT — see LICENSE.

中文 TL;DR

让 Claude 看到你的屏幕。MCP server，跨 Win/Mac/Linux，零原生依赖。

填补 Anthropic 官方 computer-use MCP 仅 macOS 的空白，外加 OCR（省 vision token 10-100x）和智能 vision-diff（让 24/7 监测在 token 经济上可行）。

8 个 tool（截屏 / 区域 / 列显示器 / 列窗口 / OCR / 找文字 / 智能截屏 / 看变化），跨平台一致。每个 release 都过了 3 agent 联合审核（代码质量 + silent failure + security），共修了 16 个 P0 才发出去。

git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp && npm install && npm run build
claude mcp add screen -- node "$(pwd)/dist/index.js"
# 重启 Claude Code，然后说"截一张屏幕给我看"

中文 OCR 默认开启（eng+chi_sim），无需额外配置。

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured