computer-control-mcp-enhanced
Enhanced MCP server for full computer control: mouse, keyboard, screenshots, OCR, deep UI automation, semantic element actions, process management, filesystem watching, and accessibility-aware app launching.
README
Computer Control MCP (Enhanced)
Enhanced MCP server for full computer control: mouse, keyboard, screenshots, OCR, deep UI automation, semantic element actions, process management, filesystem watching, and accessibility-aware app launching. Built for AI agents that need to see, understand, and interact with desktop applications.
Enhanced fork of computer-control-mcp by AB498.
<div align="center" style="text-align:center;font-family: monospace; display: flex; align-items: center; justify-content: center; width: 100%; gap: 10px"> <a href="https://img.shields.io/badge/License-MIT-yellow.svg"><img src="https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge&color=00CC00" alt="License: MIT"></a> <a href="https://pypi.org/project/computer-control-mcp-enhanced"><img src="https://img.shields.io/pypi/v/computer-control-mcp-enhanced?style=for-the-badge" alt="PyPi"></a> </div>
Quick Usage (MCP Setup Using uvx)
Running uvx computer-control-mcp-enhanced@latest for the first time will download Python dependencies (~70MB). Subsequent runs are instant.
{
"mcpServers": {
"computer-control": {
"command": "uvx",
"args": ["computer-control-mcp-enhanced@latest"]
}
}
}
Or install globally with pip:
pip install computer-control-mcp-enhanced
computer-control-mcp-enhanced
What's New (vs upstream)
This fork adds significant perception and automation capabilities for AI agents:
- UI Automation — Full Windows UI Automation (UIA) and Linux AT-SPI tree traversal with occlusion filtering, exposing interactive elements (buttons, text fields, menus) with absolute screen coordinates
- Deep UI Automation — Discover elements with stable refs, traverse the element tree (parent/child navigation), and perform semantic actions (toggle, select, invoke, set text/range values, expand/collapse, move/resize) plus native text manipulation (select by search/offset, cursor positioning, word/line queries, text bounding rectangles) — no coordinate math needed
- Combined perception (
take_screenshot_full) — Image + OCR + UI automation in a single call with parallel execution, selectable viainclude_image/include_ocr/include_uiflags - Region capture — All screenshot/OCR/UI tools accept a
region=[x, y, w, h]parameter to capture arbitrary screen rectangles instead of full screen or full window - Coordinate verification (
capture_region_around) — Capture a small area around target coordinates with coordinate rulers showing real screen positions on edges, plus optional red circle marker. Agents read coordinates directly from rulers — no scale factor math needed, typically 1-2 steps instead of 5-6 - Wait & polling tools —
wait_for_window(appear/disappear/active),wait_for_focused_element,wait_for_screen_change— synchronize with application state instead of blind delays - Filesystem watching — Persistent directory watchers with event queues, or one-shot waits for file changes. Monitor builds, downloads, or any filesystem activity
- Process & system management —
kill_process,list_processes,get_system_info— full process lifecycle and system diagnostics - Accessibility-aware app launching (
launch_app) — Launch apps with the right accessibility flags for maximum UI element exposure (e.g.--force-renderer-accessibilityfor Chromium,ACCESSIBILITY_ENABLED=1for VS Code on Linux) - Change detection — Pixel diff, OCR diff, and UI diff tools to detect what changed on screen between actions
- Screenshot optimization — Prescaling to agent-friendly sizes, WebP/JPEG compression, grayscale/BW modes for token savings
- Performance — Parallel OCR tiling for full-screen captures,
app_filterto skip irrelevant window trees in UI automation (~18s to ~0.2s for targeted windows), role-based heuristics instead of expensive COM pattern queries
Features
- Full mouse control (click, move, drag, button hold)
- Keyboard input (type text, press keys, key combinations, hold keys)
- Screenshots of full screen, specific windows, or arbitrary regions
- OCR text extraction with absolute screen coordinates
- UI automation element detection (buttons, fields, menus, etc.)
- Deep UI automation with stable element refs, tree traversal, and semantic actions
- Semantic element actions: toggle, select, invoke, focus, expand/collapse, set text, set range value, move/resize
- Native text manipulation: select by offset or search, read selections, cursor positioning, word/line queries, text bounding rectangles
- Table/grid data reading with headers and paging support
- Programmatic container scrolling (page/line/percent) with scroll position queries
- View switching, virtualized item realization, drag info, and hyperlink navigation
- Window management (list, activate, wait for appear/disappear/active, fuzzy/regex matching)
- Process management (list, kill) and system diagnostics (CPU, memory, disk, OS)
- Filesystem watching (persistent watchers with event queues, one-shot file change waits)
- Screen change detection (pixel, OCR, and UI automation diffs)
- Accessibility-aware app launching for better UI element exposure
- Coordinate verification with ruler overlays and visual markers for precise clicking
- Image optimization (format, quality, color mode, prescaling)
- GPU-accelerated window capture via WGC (Windows only)
- Clipboard operations
- Action batching via
perform_actions
Available Tools
Mouse Control
| Tool | Description |
|---|---|
click_screen(x, y) |
Click at screen coordinates |
move_mouse(x, y) |
Move mouse to coordinates |
get_mouse_position() |
Get current mouse pointer coordinates |
drag_mouse(from_x, from_y, to_x, to_y, duration) |
Drag from one position to another |
mouse_down(button) |
Hold down a mouse button |
mouse_up(button) |
Release a mouse button |
Keyboard Control
| Tool | Description |
|---|---|
type_text(text) |
Type text at current cursor position |
press_keys(keys) |
Press keys (single, sequences, or combinations like [["ctrl", "c"]]) |
key_down(key) |
Hold down a key |
key_up(key) |
Release a key |
Screenshots & Perception
| Tool | Description |
|---|---|
take_screenshot(...) |
Capture screen/window/region as an image. Supports title_pattern, region, format/quality options |
take_screenshot_with_ocr(...) |
Screenshot + OCR text extraction with absolute coordinates |
take_screenshot_with_ui_automation(...) |
Get UI automation elements (buttons, fields, etc.) with coordinates |
take_screenshot_full(...) |
Combined image + OCR + UI automation in one call. Use include_image, include_ocr, include_ui flags to select layers |
capture_region_around(x, y, radius, mark_center, ...) |
Capture a small region around coordinates, optionally with a red circle marker for verification |
Screen Change Detection
| Tool | Description |
|---|---|
check_screen_changed(...) |
Pixel-level diff between current screen and last baseline |
check_screen_changed_with_images(...) |
Pixel diff with annotated diff images |
check_screen_changed_full(...) |
Combined pixel + OCR + UI diff. Use include_image_diff, include_ocr_diff, include_ui_diff flags |
check_ocr_changed(...) |
Text-level diff via OCR (added/removed/changed text) |
check_ui_automation_changed(...) |
UI element diff (added/removed/changed elements) |
wait_for_screen_change(...) |
Poll until screen changes or timeout |
Text Interaction
| Tool | Description |
|---|---|
find_text(text, ...) |
Find text on screen via OCR with fuzzy matching |
click_text(text, ...) |
Find and click on text |
fill_text_field(x, y, text, ...) |
Click a field, clear it, and type new text |
Cursor & Position
| Tool | Description |
|---|---|
get_mouse_position() |
Current mouse pointer coordinates |
get_cursor_position() |
Text caret/cursor position (Windows only) |
get_screen_size() |
Screen resolution |
Window & App Management
| Tool | Description |
|---|---|
list_windows() |
List all open windows with titles and positions |
activate_window(title_pattern, ...) |
Bring a window to foreground (fuzzy or regex matching) |
get_active_window() |
Get the currently active/foreground window |
close_window(title_pattern, ...) |
Close a window |
launch_app(command, ...) |
Launch an app with accessibility flags enabled for better UI automation |
wait_for_window(title_pattern, mode, ...) |
Wait for a window to appear, disappear, or become active |
Deep UI Automation — Discovery
| Tool | Description |
|---|---|
find_ui_elements(title_pattern, ...) |
Deep-search UI elements by name, role, or text content with paging support |
get_focused_element(title_pattern, ...) |
Get the currently focused accessible element |
get_element_at_point(x, y) |
Get the deepest UI element at screen coordinates |
get_element_details(element_ref) |
Get full details of an element (patterns, states, properties) |
get_element_children(element_ref) |
Get child elements of a container |
get_element_parent(element_ref) |
Get parent element |
Deep UI Automation — Semantic Actions
| Tool | Description |
|---|---|
focus_element(element_ref) |
Give keyboard focus to an element |
invoke_element(element_ref) |
Click/activate a button, link, or menu item |
toggle_element(element_ref) |
Toggle a checkbox, switch, or toggle button |
select_element(element_ref) |
Select a list item, tab, or radio button |
expand_element(element_ref) |
Expand a tree node, combo box, or menu |
collapse_element(element_ref) |
Collapse a tree node or combo box |
set_element_text(element_ref, text) |
Set text value of an input field |
get_element_text(element_ref) |
Read text value from an element |
select_text_range(element_ref, start, end) |
Select text by character offset range |
select_text_by_search(element_ref, search_text) |
Find and select a substring in a text element |
get_text_selection(element_ref) |
Read the currently selected text |
get_text_caret_offset(element_ref) |
Get cursor/caret position as character offset |
set_text_caret_offset(element_ref, offset) |
Move cursor to a specific character offset |
get_text_at_offset(element_ref, offset, unit) |
Get word, line, or paragraph at an offset |
get_text_bounds(element_ref, start, end) |
Get screen-space bounding rectangles for a text range |
scroll_element_into_view(element_ref) |
Scroll an element into the visible area |
set_element_range_value(element_ref, value) |
Set numeric value on a slider or range control |
move_element_ui(element_ref, x, y) |
Move an element (window) to a position |
resize_element_ui(element_ref, width, height) |
Resize an element (window) |
set_element_extents(element_ref, x, y, width, height) |
Move + resize in one call |
wait_for_focused_element(...) |
Wait until the focused element matches name/role filters |
Deep UI Automation — Tables, Scrolling & Advanced
| Tool | Description |
|---|---|
get_table_data(element_ref, start_row, max_rows) |
Read data from tables/grids with headers and paging |
scroll_element_container(element_ref, direction, amount, unit) |
Programmatically scroll a container (page/line/percent) |
get_scroll_info(element_ref) |
Get scroll position and scrollability info |
get_element_views(element_ref) |
Get available views (list/details/icons) |
set_element_view(element_ref, view_id) |
Switch to a different view |
realize_element(element_ref) |
Force-load a virtualized item in a large list |
get_drag_info(element_ref) |
Get drag pattern info (is grabbed, drop effects) |
get_hyperlinks(element_ref) |
Get hyperlinks in a text element with URIs and offsets |
activate_hyperlink(element_ref, link_index) |
Activate a hyperlink by index |
Process & System Management
| Tool | Description |
|---|---|
list_processes() |
List all running processes with PIDs and memory usage |
kill_process(process_name, pid, force) |
Kill/terminate a running process |
get_system_info() |
Get CPU, memory, disk, OS, and network information |
Filesystem Watching
| Tool | Description |
|---|---|
start_file_watch(paths, ...) |
Start a persistent filesystem watcher, returns a watch_id |
get_file_watch_events(watch_id) |
Read queued events from a persistent watcher |
stop_file_watch(watch_id) |
Stop a persistent watcher |
wait_for_file_change(paths, timeout_ms) |
One-shot wait for the next filesystem change |
Utilities
| Tool | Description |
|---|---|
set_clipboard(text) |
Set clipboard contents |
get_clipboard() |
Get clipboard contents |
wait_milliseconds(ms) |
Wait/sleep |
perform_actions(actions) |
Execute a batch of actions sequentially |
get_monitors() |
Get information about connected monitors |
get_agent_guide() |
Get the full agent skill guide — best practices, workflows, and troubleshooting |
Region Capture
All screenshot, OCR, UI automation, and change detection tools accept a region parameter:
# Capture just the top-left quadrant
take_screenshot_full(region=[0, 0, 960, 540])
# OCR only within a specific area
take_screenshot_with_ocr(region=[100, 200, 400, 300])
# Activate a window, then capture a sub-region of it
take_screenshot_full(title_pattern="Notepad", region=[1300, 280, 200, 100])
# Detect changes only within a region
check_screen_changed(region=[500, 500, 400, 300])
This lets an agent start with a full-screen capture, identify the area of interest, then focus subsequent captures on just that region to save tokens.
Coordinate Verification Workflow
AI agents often misjudge coordinates from prescaled screenshots. The capture_region_around tool solves this with coordinate rulers — real screen coordinates drawn directly on the image edges:
# Capture a region with coordinate rulers (enabled by default)
capture_region_around(x=935, y=630, radius=80, ruler_tick_interval=25)
# → Zoomed-in view with X coordinates on top edge, Y coordinates on left edge
# → Agent reads target's exact position from the rulers — no math needed
# → Click the coordinates directly
How it works:
- Agent takes a full screenshot and roughly estimates the target area
- Calls
capture_region_around— rulers on edges show real screen coordinates - Agent reads the target's exact coordinates from the ruler grid lines
- Clicks — typically accurate on the first attempt
Optional: add mark_center=True to also draw a red circle marker at the estimated position for visual confirmation before clicking.
This replaces the old multi-iteration guessing loop with a 1-2 step process.
Accessibility-Aware App Launching
The launch_app tool automatically applies the right accessibility flags per app family:
| App Family | Flag/Env Var | Platform |
|---|---|---|
| Chromium browsers | --force-renderer-accessibility |
All |
| Electron apps | --force-renderer-accessibility |
All |
| VS Code family | ACCESSIBILITY_ENABLED=1 |
Linux |
| Qt/KDE apps | QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1 |
Linux |
| GTK/GNOME apps | Session-level AT-SPI activation | Linux |
Example: Microsoft Edge exposes 90 UI elements normally vs 199 elements with the accessibility flag (2.2x increase).
# Launch Chrome with accessibility enabled
launch_app(command=["google-chrome", "https://example.com"])
# Preview what would happen without launching
launch_app(command=["code", "."], dry_run=True)
Deep UI Automation
Go beyond coordinate-based clicking — interact with UI elements semantically using stable element refs.
Discovery: Find elements by name, role, or text content across any window. Results include stable refs that can be passed directly to action tools.
# Find all toggle switches in Windows Settings
find_ui_elements(title_pattern="Settings", role_filter="push button", text_filter="Night light")
# Get the element under specific coordinates
get_element_at_point(x=500, y=300)
# Navigate the element tree
get_element_children(element_ref={...})
get_element_parent(element_ref={...})
Semantic Actions: Toggle switches, select tabs, fill text fields, adjust sliders — all without calculating coordinates.
# Toggle a switch
toggle_element(element_ref={...})
# Set a text field directly
set_element_text(element_ref={...}, text="Hello World")
# Adjust a slider
set_element_range_value(element_ref={...}, value=75)
# Select a tab or list item
select_element(element_ref={...})
# Move/resize a window via UI automation
set_element_extents(element_ref={...}, x=100, y=100, width=800, height=600)
Text Manipulation: Select, navigate, and query text within document/edit elements — no keyboard simulation needed.
# Find and select a substring
select_text_by_search(element_ref={...}, search_text="brown fox")
# Select by character offsets
select_text_range(element_ref={...}, start=4, end=19)
# Read what's currently selected
get_text_selection(element_ref={...}) # → {"selections": [{"text": "brown fox"}]}
# Get/set cursor position
get_text_caret_offset(element_ref={...}) # → {"offset": 10}
set_text_caret_offset(element_ref={...}, offset=50)
# Get word or line at a position
get_text_at_offset(element_ref={...}, offset=4, unit="word") # → {"text": "quick "}
# Get screen coordinates for a text range
get_text_bounds(element_ref={...}, start=4, end=19) # → bounding rectangles
Element refs survive minor UI changes (scrolling, focus shifts) but need re-discovery if the window title changes or the element tree restructures.
Built-in Agent Guide
The MCP includes a built-in skill guide that any AI agent can retrieve at runtime:
get_agent_guide()
# → {"guide_path": "/path/to/AGENT_GUIDE.md", "message": "Read the file at guide_path..."}
Returns the file path to a comprehensive markdown guide (~50KB). The agent reads the file at that path to learn best practices, tool selection, workflows, and troubleshooting for all 85 tools. No external documentation or agent skill files needed — the guide ships with the pip package.
Filesystem Watching
Monitor directories for file changes — useful for watching build output, downloads, or log files.
# Persistent watcher
start_file_watch(paths="C:/project/dist")
# ... trigger a build ...
get_file_watch_events(watch_id="...") # → created, modified, deleted events
stop_file_watch(watch_id="...")
# One-shot wait
wait_for_file_change(paths="C:/Users/me/Downloads", timeout_ms=30000)
Requires the watchdog Python library (pip install watchdog).
Configuration
Custom Screenshot Directory
{
"mcpServers": {
"computer-control": {
"command": "uvx",
"args": ["computer-control-mcp-enhanced@latest"],
"env": {
"COMPUTER_CONTROL_MCP_SCREENSHOT_DIR": "C:\\Users\\YourName\\Pictures\\Screenshots"
}
}
}
}
Automatic WGC for Specific Windows
For GPU-accelerated windows that render black with standard capture:
{
"env": {
"COMPUTER_CONTROL_MCP_WGC_PATTERNS": "obs, discord, game, steam"
}
}
Development
# Clone
git clone https://github.com/gzmagyari/computer-control-mcp.git
cd computer-control-mcp
# Install in dev mode (edits reflect immediately)
pip install -e .
# Run the server
computer-control-mcp-enhanced
# Run tests
python -m pytest
# Build
pip install hatch
hatch build
License
MIT
Based on computer-control-mcp by AB498.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.