screen-mcp
An MCP server that gives a model eyes and hands on a Linux Wayland desktop, enabling screenshot capture, mouse/keyboard control, OCR, and icon detection via OmniParser.
README
screen-mcp
An MCP server that gives a model eyes and hands on a Linux Wayland desktop. Screenshots via PipeWire, pointer/keyboard via the RemoteDesktop portal, OCR + icon detection via RapidOCR + an OmniParser ONNX, an ambient sense layer that diffs frames so the agent knows when something opened / nothing changed, a write-through world-model cache so a recognised screen skips OCR, and an opt-in ack gate that blocks close-combos / destructive-keyword clicks until the caller passes a confirmation token.
Current version: 1.3.2.
Requirements
- Linux + Wayland + GNOME (the awareness layer uses a bundled GNOME Shell extension; AT-SPI is the fallback for GTK apps).
- Python 3.10+ (tested on 3.14).
- GStreamer >= 1.28 (uses
leaky-type; the olderdrop=was removed in 1.28). PipeWire +xdg-desktop-portal-gnome. wl-clipboard(for the Unicode paste path inscreen_type).- A DejaVu Sans Bold font (Set-of-Marks labels; falls back to PIL's default).
Install
System packages first — see requirements.txt for the full
pacman / apt one-liners.
# Arch
sudo pacman -S python-gobject gobject-introspection \
gstreamer gst-plugins-base gst-plugins-good gst-libav \
pipewire pipewire-pulse xdg-desktop-portal-gnome \
wl-clipboard ttf-dejavu
# Python deps
pip install -r requirements.txt
Install the GNOME Shell extension (optional but recommended — gives the awareness layer reliable focused-window + window-list data):
gnome-shell-extension/window-info@local/install.sh
# then enable via gnome-extensions enable window-info@local
Wire it into Claude Code
Add to ~/.claude.json under mcpServers:
{
"mcp-screen": {
"command": "python3",
"args": ["/path/to/mcp-screen/server.py"]
}
}
The first run triggers an xdg-desktop-portal consent dialog (pick which
monitor(s) to share). The portal returns a restore token which is persisted to
~/.config/mcp-screen/token — subsequent runs are silent.
Tools
| Name | What it does |
|---|---|
screen_screenshot |
Capture the desktop. region=[x,y,w,h] or monitor=N to zoom. annotate=true overlays numbered Set-of-Marks + lists click coords. use_cache=true (with annotate) reuses learned elements for a known screen (skips OCR). fresh=true forces a current frame on a damage-driven static monitor (defeats the keepalive-resend stale read) — but it nudges the pointer, so it's used sparingly (auto only right after an unconfirmed action), not on every shot; pass it explicitly if a static-monitor read looks stale. |
screen_list_monitors |
Monitors (origin/size/scale), desktop bounds, focused windows. |
screen_move_mouse |
Move pointer to x,y (view-space default; server maps to real px). |
screen_click |
Click at x,y or in place. button: left|right|middle, double: true. |
screen_scroll |
Wheel scroll. direction: up|down|left|right, amount: notches. |
screen_drag |
Press-drag from (x1,y1) to (x2,y2). |
screen_key |
Press a key/combo: "Ctrl+L", "Enter", "Alt+Tab", "F5". |
screen_type |
Type text (Unicode via wl-copy + Ctrl+V; ASCII via keysyms). enter: true presses Enter after. Keys go to the FOCUSED window — pass focus: "app" or call screen_focus first. |
screen_focus |
Raise + give KEYBOARD focus to a window (app/title/id) so injected keys/clicks land in it. Uses the window-info extension's ActivateWindow when loaded, else the GNOME overview. |
screen_do |
Batched ordered actions in one call. |
screen_tour |
Visit several UI states and get a labeled thumbnail of each. |
screen_read_page |
Auto-scroll a scrollable view in one call; accumulates every interactable. |
screen_wait |
Block until the screen settles, then optionally screenshot. |
screen_session |
Recorder: op=start|stop|list|status|replay-path. |
screen_reload |
Hot-reload the server in place after edits (no /mcp reconnect). |
screen_diag |
Health dump: session/geo, cursor, grounding backends, world-model stats. |
Every action takes space: 'view' \| 'desktop' \| 'norm' (default view — coords
as seen in the last screenshot), shot: true to return a screenshot after,
verify: true to warn on no-screen-change misclicks, force: true to bypass
the user-takeover guard, and element: <id> to click an element id returned by
the last annotate=true shot (server resolves exact coords; no guessing).
Environment variables
| Variable | Effect |
|---|---|
MCP_SCREEN_GUARD=1 |
Enable the reliability ack gate. Destructive combos (Ctrl+W, Alt+F4, cmd+q), OCR-matched destructive keywords (delete/pay/submit/...), and out-of-allowlist actions block unless the caller passes ack=<reason>. |
MCP_SCREEN_APPS="firefox,terminal" |
With guard on, restrict actions to this allowlist of focused apps. |
MCP_SCREEN_AUDIT_FRAMES=1 |
Add pre/post frame hash + changed_bbox to every audit log line. ~100-500ms latency per action. |
MCP_SCREEN_AMBIENT=0 |
Disable the ambient SENSE hint block. |
MCP_SCREEN_GUARD_PX=40 |
Threshold for the user-takeover guard (live pointer vs last-commanded). |
MCP_SCREEN_CPU_THREADS=6 |
ONNX intra-op thread count for OmniParser. |
MCP_SCREEN_MAX_EDGE=2576 |
Screenshot downscale target (long edge). |
MCP_SCREEN_NO_FRESH=1 |
Disable forced fresh-frame capture on static monitors (screenshots may then return the keepalive-resent stale frame). |
MCP_SCREEN_FOCUS_SETTLE_MS=150 |
Delay after screen_focus activates a window (lets the compositor deliver keyboard focus before a following keystroke burst). |
MCP_SCREEN_NO_NUDGE=1 |
Disable the pointer damage-nudge used to prime/refresh a static monitor's frame. |
Data paths
| Path | What |
|---|---|
~/.config/mcp-screen/token |
Portal restore token (one-time consent). |
~/.local/share/mcp-screen/world/map.db |
World-model SQLite cache (per-screen learned elements). |
~/.local/share/mcp-screen/sessions/<sid>/ |
Recorder trajectories + WebP frames + replay.html. |
~/.local/state/mcp-screen/actions.jsonl |
Reliability audit log (one JSON line per action). |
/tmp/screen_err.txt |
Last unhandled tool traceback (dev-diagnostic only). |
Dev workflow
pytest -q # 78 tests, ~0.7s, no live D-Bus needed (conftest stubs)
Edit a .py, then in the running Claude Code session:
screen_reload # re-execs the server in place (preserves the MCP connection)
On any tool exception the dispatcher writes the full traceback to
/tmp/screen_err.txt (the JSON-RPC error only carries the message); read it
when debugging crashes.
Ops notes (hard-won — read before touching capture/input)
- Fractional scaling —
NotifyPointerMotionAbsolutecoords are logical and local to each stream (keyed bynode_id). Don't add a global logical origin; the portal clamps with "Invalid position". Seeinput.global_to_logical. - Cursor position —
cursor_mode=METADATA(4)means the cursor is NOT baked into frames. PipeWire attaches aSPA_META_Cursorto its src pad, butvideoconvertstrips it and PyGObject can't downcast it —capture.pyreads it via actypespad-probe with x86-64 offsets. We composite a marker back into plain screenshots so the pointer stays visible. - User-takeover guard —
input.guard_usercompares the live pointer to where WE last commanded it; >MCP_SCREEN_GUARD_PXpx drift ⇒ caller took the mouse ⇒ STOP. Passforce=trueto bypass / take control back. Fails open if the cursor can't be read. - Unicode typing — the portal keysym path drops non-ASCII;
input.type_textauto-pastes any non-ASCII string viawl-copy+ Ctrl+V, with afinallyrestoring the prior clipboard (orwl-copy --clearif it couldn't be saved) so sensitive text never outlives the call. Falls back to ASCII-only keysyms ifwl-clipboardis absent. xdotool / XTEST can NOT reach native-Wayland apps. - Modifier+letter combos —
input.keylowercases single-letter trailing parts when modifiers are present, so"Ctrl+A"is select-all, not Ctrl+Shift+a (capital-A is the X11 keysym for shifted A). Standalonekey("A")keeps its case for legacy text-input behavior. - GPU is hard-disabled (
CUDA_VISIBLE_DEVICES=""at server top); grounding is CPU-only by design — predictable latency, no driver flake.
Install as a Claude Code plugin
screen-mcp ships as a Claude Code plugin that bundles the MCP server and a
drive-screen skill (the locate → ground → act → confirm loop).
/plugin marketplace add 88plug/screen-mcp
/plugin install screen-mcp@screen-mcp
One-time setup after install (the server has system + Python deps the manifest can't install for you):
# in the installed plugin dir (or a clone)
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
# system packages (Arch/Manjaro names; use your distro equivalents):
# gstreamer>=1.28, pipewire, python-gobject, xdg-desktop-portal-gnome, wl-clipboard
Requirements: Linux + Wayland + GNOME. First run pops an xdg-desktop-portal
RemoteDesktop + ScreenCast consent dialog (token cached at ~/.config/mcp-screen).
Optional: /dev/uinput (group input) for the kernel input backend, and the
bundled GNOME-Shell extension for full window awareness (one-time Wayland re-login).
The launcher (bin/screen-mcp) fails with a clear message if the deps are missing,
so a misconfigured install never silently half-works.
License
FSL-1.1-ALv2 © 2026 88plug — Functional Source License; converts to Apache 2.0 two years after each release.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.