screen-mcp

screen-mcp

An MCP server that gives a model eyes and hands on a Linux Wayland desktop, enabling screenshot capture, mouse/keyboard control, OCR, and icon detection via OmniParser.

Category
Visit Server

README

screen-mcp

An MCP server that gives a model eyes and hands on a Linux Wayland desktop. Screenshots via PipeWire, pointer/keyboard via the RemoteDesktop portal, OCR + icon detection via RapidOCR + an OmniParser ONNX, an ambient sense layer that diffs frames so the agent knows when something opened / nothing changed, a write-through world-model cache so a recognised screen skips OCR, and an opt-in ack gate that blocks close-combos / destructive-keyword clicks until the caller passes a confirmation token.

Current version: 1.3.2.

Requirements

  • Linux + Wayland + GNOME (the awareness layer uses a bundled GNOME Shell extension; AT-SPI is the fallback for GTK apps).
  • Python 3.10+ (tested on 3.14).
  • GStreamer >= 1.28 (uses leaky-type; the older drop= was removed in 1.28). PipeWire + xdg-desktop-portal-gnome.
  • wl-clipboard (for the Unicode paste path in screen_type).
  • A DejaVu Sans Bold font (Set-of-Marks labels; falls back to PIL's default).

Install

System packages first — see requirements.txt for the full pacman / apt one-liners.

# Arch
sudo pacman -S python-gobject gobject-introspection \
               gstreamer gst-plugins-base gst-plugins-good gst-libav \
               pipewire pipewire-pulse xdg-desktop-portal-gnome \
               wl-clipboard ttf-dejavu

# Python deps
pip install -r requirements.txt

Install the GNOME Shell extension (optional but recommended — gives the awareness layer reliable focused-window + window-list data):

gnome-shell-extension/window-info@local/install.sh
# then enable via gnome-extensions enable window-info@local

Wire it into Claude Code

Add to ~/.claude.json under mcpServers:

{
  "mcp-screen": {
    "command": "python3",
    "args": ["/path/to/mcp-screen/server.py"]
  }
}

The first run triggers an xdg-desktop-portal consent dialog (pick which monitor(s) to share). The portal returns a restore token which is persisted to ~/.config/mcp-screen/token — subsequent runs are silent.

Tools

Name What it does
screen_screenshot Capture the desktop. region=[x,y,w,h] or monitor=N to zoom. annotate=true overlays numbered Set-of-Marks + lists click coords. use_cache=true (with annotate) reuses learned elements for a known screen (skips OCR). fresh=true forces a current frame on a damage-driven static monitor (defeats the keepalive-resend stale read) — but it nudges the pointer, so it's used sparingly (auto only right after an unconfirmed action), not on every shot; pass it explicitly if a static-monitor read looks stale.
screen_list_monitors Monitors (origin/size/scale), desktop bounds, focused windows.
screen_move_mouse Move pointer to x,y (view-space default; server maps to real px).
screen_click Click at x,y or in place. button: left|right|middle, double: true.
screen_scroll Wheel scroll. direction: up|down|left|right, amount: notches.
screen_drag Press-drag from (x1,y1) to (x2,y2).
screen_key Press a key/combo: "Ctrl+L", "Enter", "Alt+Tab", "F5".
screen_type Type text (Unicode via wl-copy + Ctrl+V; ASCII via keysyms). enter: true presses Enter after. Keys go to the FOCUSED window — pass focus: "app" or call screen_focus first.
screen_focus Raise + give KEYBOARD focus to a window (app/title/id) so injected keys/clicks land in it. Uses the window-info extension's ActivateWindow when loaded, else the GNOME overview.
screen_do Batched ordered actions in one call.
screen_tour Visit several UI states and get a labeled thumbnail of each.
screen_read_page Auto-scroll a scrollable view in one call; accumulates every interactable.
screen_wait Block until the screen settles, then optionally screenshot.
screen_session Recorder: op=start|stop|list|status|replay-path.
screen_reload Hot-reload the server in place after edits (no /mcp reconnect).
screen_diag Health dump: session/geo, cursor, grounding backends, world-model stats.

Every action takes space: 'view' \| 'desktop' \| 'norm' (default view — coords as seen in the last screenshot), shot: true to return a screenshot after, verify: true to warn on no-screen-change misclicks, force: true to bypass the user-takeover guard, and element: <id> to click an element id returned by the last annotate=true shot (server resolves exact coords; no guessing).

Environment variables

Variable Effect
MCP_SCREEN_GUARD=1 Enable the reliability ack gate. Destructive combos (Ctrl+W, Alt+F4, cmd+q), OCR-matched destructive keywords (delete/pay/submit/...), and out-of-allowlist actions block unless the caller passes ack=<reason>.
MCP_SCREEN_APPS="firefox,terminal" With guard on, restrict actions to this allowlist of focused apps.
MCP_SCREEN_AUDIT_FRAMES=1 Add pre/post frame hash + changed_bbox to every audit log line. ~100-500ms latency per action.
MCP_SCREEN_AMBIENT=0 Disable the ambient SENSE hint block.
MCP_SCREEN_GUARD_PX=40 Threshold for the user-takeover guard (live pointer vs last-commanded).
MCP_SCREEN_CPU_THREADS=6 ONNX intra-op thread count for OmniParser.
MCP_SCREEN_MAX_EDGE=2576 Screenshot downscale target (long edge).
MCP_SCREEN_NO_FRESH=1 Disable forced fresh-frame capture on static monitors (screenshots may then return the keepalive-resent stale frame).
MCP_SCREEN_FOCUS_SETTLE_MS=150 Delay after screen_focus activates a window (lets the compositor deliver keyboard focus before a following keystroke burst).
MCP_SCREEN_NO_NUDGE=1 Disable the pointer damage-nudge used to prime/refresh a static monitor's frame.

Data paths

Path What
~/.config/mcp-screen/token Portal restore token (one-time consent).
~/.local/share/mcp-screen/world/map.db World-model SQLite cache (per-screen learned elements).
~/.local/share/mcp-screen/sessions/<sid>/ Recorder trajectories + WebP frames + replay.html.
~/.local/state/mcp-screen/actions.jsonl Reliability audit log (one JSON line per action).
/tmp/screen_err.txt Last unhandled tool traceback (dev-diagnostic only).

Dev workflow

pytest -q                   # 78 tests, ~0.7s, no live D-Bus needed (conftest stubs)

Edit a .py, then in the running Claude Code session:

screen_reload              # re-execs the server in place (preserves the MCP connection)

On any tool exception the dispatcher writes the full traceback to /tmp/screen_err.txt (the JSON-RPC error only carries the message); read it when debugging crashes.

Ops notes (hard-won — read before touching capture/input)

  • Fractional scalingNotifyPointerMotionAbsolute coords are logical and local to each stream (keyed by node_id). Don't add a global logical origin; the portal clamps with "Invalid position". See input.global_to_logical.
  • Cursor positioncursor_mode=METADATA(4) means the cursor is NOT baked into frames. PipeWire attaches a SPA_META_Cursor to its src pad, but videoconvert strips it and PyGObject can't downcast it — capture.py reads it via a ctypes pad-probe with x86-64 offsets. We composite a marker back into plain screenshots so the pointer stays visible.
  • User-takeover guardinput.guard_user compares the live pointer to where WE last commanded it; > MCP_SCREEN_GUARD_PX px drift ⇒ caller took the mouse ⇒ STOP. Pass force=true to bypass / take control back. Fails open if the cursor can't be read.
  • Unicode typing — the portal keysym path drops non-ASCII; input.type_text auto-pastes any non-ASCII string via wl-copy + Ctrl+V, with a finally restoring the prior clipboard (or wl-copy --clear if it couldn't be saved) so sensitive text never outlives the call. Falls back to ASCII-only keysyms if wl-clipboard is absent. xdotool / XTEST can NOT reach native-Wayland apps.
  • Modifier+letter combosinput.key lowercases single-letter trailing parts when modifiers are present, so "Ctrl+A" is select-all, not Ctrl+Shift+a (capital-A is the X11 keysym for shifted A). Standalone key("A") keeps its case for legacy text-input behavior.
  • GPU is hard-disabled (CUDA_VISIBLE_DEVICES="" at server top); grounding is CPU-only by design — predictable latency, no driver flake.

Install as a Claude Code plugin

screen-mcp ships as a Claude Code plugin that bundles the MCP server and a drive-screen skill (the locate → ground → act → confirm loop).

/plugin marketplace add 88plug/screen-mcp
/plugin install screen-mcp@screen-mcp

One-time setup after install (the server has system + Python deps the manifest can't install for you):

# in the installed plugin dir (or a clone)
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
# system packages (Arch/Manjaro names; use your distro equivalents):
#   gstreamer>=1.28, pipewire, python-gobject, xdg-desktop-portal-gnome, wl-clipboard

Requirements: Linux + Wayland + GNOME. First run pops an xdg-desktop-portal RemoteDesktop + ScreenCast consent dialog (token cached at ~/.config/mcp-screen). Optional: /dev/uinput (group input) for the kernel input backend, and the bundled GNOME-Shell extension for full window awareness (one-time Wayland re-login).

The launcher (bin/screen-mcp) fails with a clear message if the deps are missing, so a misconfigured install never silently half-works.

License

FSL-1.1-ALv2 © 2026 88plug — Functional Source License; converts to Apache 2.0 two years after each release.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured