Ubuntu Desktop Control MCP
Enables AI assistants to control Ubuntu desktops through screenshots, mouse clicks, and keyboard interactions using AT-SPI integration and computer vision. It features optimized element detection and workflow batching for fast and accurate visual interaction with desktop applications.
README
Ubuntu Desktop Control MCP Server
An MCP (Model Context Protocol) server that enables LLMs to control your Ubuntu desktop by taking screenshots and sending mouse clicks. This allows AI assistants to visually interact with your desktop applications.
ā” NEW: Optimized Production Workflow
5x faster, 5x more accurate! Now using the same optimization techniques as Anthropic's Computer Use API:
- šø Smart Screenshots: Auto-downsampled to 1280x720 (5x smaller)
- šÆ Numbered Elements: See what's clickable at a glance with overlaid IDs
- š¤ AT-SPI Integration: Automatic UI element detection using accessibility API
- š Percentage Coords: Resolution-agnostic positioning (no more pixel hunting!)
- ā” Workflow Batching: Execute multiple actions in one MCP call
- šŖ Element Cache: Direct element interaction - "click element #5"
Example - Old way (8+ calls, ~15s):
take_screenshot() ā analyze ā grid overlay ā zoom quadrant ā find pixel ā click ā miss
Example - New way (1 call, ~3s):
take_screenshot() ā "I see Pinta is element #5" ā click_screen(element_id=5) ā ā
See README.md for full details.
Features
- šø Screenshot Capture: Annotated screenshots with automatic element detection
- š¢ Element Detection: AT-SPI + CV fallback for robust UI element identification
- š±ļø Smart Clicking: Click by element ID or percentage coordinates
- āØļø Keyboard Control: Type text and press keys/hotkeys
- šÆ Mouse Movement: Smooth cursor positioning with animation
- š Workflow Batching: Execute multi-step tasks in single MCP call
- š Diagnostics: Display scaling detection, warnings, and recommendations
Quick Start
1. Prerequisites
- Ubuntu Linux (X11 required, Wayland not fully supported)
- Python 3.9+
2. Installation
From PyPI (Recommended)
pip install ubuntu-desktop-control
From Source
# Clone repository
git clone https://github.com/charettep/ubuntu-desktop-control-mcp.git
cd ubuntu-desktop-control-mcp
# Install system dependencies (requires sudo)
chmod +x scripts/install.sh
./scripts/install.sh
# Install Python dependencies
pip install -e .
Configuration
Claude Code
<details> <summary>Installation Methods</summary>
Method 1: CLI (Recommended)
claude mcp add --transport stdio ubuntu-desktop-control -- \
ubuntu-desktop-control
Method 2: Manual Config
Edit ~/.claude/claude_desktop_config.json:
{
"mcpServers": {
"ubuntu-desktop-control": {
"command": "ubuntu-desktop-control",
"args": []
}
}
}
</details>
VS Code Insiders
<details> <summary>Installation Methods</summary>
Method 1: MCP Command
- Open Command Palette (
Ctrl+Shift+P) - Run
MCP: Open Workspace Folder Configuration - Add the server configuration below.
Method 2: Manual Config
Create .vscode/mcp.json in your workspace:
{
"servers": {
"ubuntu-desktop-control": {
"type": "stdio",
"command": "ubuntu-desktop-control",
"args": []
}
}
}
</details>
Codex CLI
<details> <summary>Installation Methods</summary>
Method 1: CLI
codex mcp add ubuntu-desktop-control -- \
ubuntu-desktop-control
Method 2: Manual Config
Edit ~/.config/codex/config.toml:
[mcp_servers.ubuntu-desktop-control]
type = "stdio"
command = "ubuntu-desktop-control"
args = []
</details>
Tools
Core Capabilities
| Tool | Description |
|---|---|
take_screenshot |
Capture the desktop (optionally per-monitor) with annotated elements. |
click_screen |
Click by element ID or percentage coordinates (supports per-monitor). |
move_mouse |
Move the cursor by element ID or percentage coordinates (supports per-monitor). |
drag_mouse |
Drag the cursor to coordinates while holding a mouse button. |
type_text |
Type text using the keyboard. |
press_key |
Press a specific key (e.g., 'enter', 'esc'). |
press_hotkey |
Press a combination of keys simultaneously (e.g., Ctrl+Shift+C). |
get_screen_info |
Get screen dimensions and display server type (X11/Wayland). |
get_display_diagnostics |
Troubleshoot scaling and coordinate mismatches. |
map_GUI_elements_location |
Detect and map UI elements (hitboxes) using Computer Vision. |
convert_screenshot_coordinates |
Convert pixels from a screenshot to logical click coordinates. |
list_prompt_templates |
List available prompt templates (for clients without native prompt support). |
execute_workflow |
Execute a batch of actions (screenshot/click/move/type/wait). |
Prompt Rendering Tools
These tools allow clients without native prompt support (like Codex CLI) to render prompt templates as text.
| Tool | Description |
|---|---|
render_prompt_baseline_display_check |
Render the baseline display check prompt. |
render_prompt_capture_full_desktop |
Render the full desktop capture prompt. |
render_prompt_capture_region_for_task |
Render the region capture prompt. |
render_prompt_convert_screenshot_coordinates |
Render the coordinate conversion prompt. |
render_prompt_safe_click |
Render the safe click prompt. |
render_prompt_hover_and_capture |
Render the hover and capture prompt. |
render_prompt_coordinate_mismatch_recovery |
Render the mismatch recovery prompt. |
render_prompt_end_to_end_capture_and_act |
Render the end-to-end workflow prompt. |
Prompts
| Prompt | Description |
|---|---|
baseline_display_check |
Check display settings and scaling before starting tasks. |
capture_full_desktop |
Capture and summarize the full desktop state. |
capture_region_for_task |
Capture a specific region for detailed inspection. |
safe_click |
Perform a click with safety checks and scaling awareness. |
hover_and_capture |
Hover to reveal UI elements, then capture. |
coordinate_mismatch_recovery |
Diagnose and fix missed clicks. |
end_to_end_capture_and_act |
Plan and execute a full interaction loop. |
Configuration & Customization
Environment Variables
The server relies on standard Linux/X11 environment variables to locate and interact with the desktop session.
| Variable | Description | Default |
|---|---|---|
DISPLAY |
X11 display identifier. Required for the server to know which screen to control. | :0 |
XDG_SESSION_TYPE |
Used to detect if running on X11 or Wayland. | unknown |
XAUTHORITY |
Path to X11 authority file. Required if running from a different user context (e.g., sudo, docker) or over SSH. | ~/.Xauthority |
UDC_FORCE_COORDS |
Force coordinate clicks (disable AT-SPI action clicks). | unset |
Passing Environment Variables
You can customize these variables in your MCP client configuration.
Claude Desktop (claude_desktop_config.json)
{
"mcpServers": {
"ubuntu-desktop-control": {
"command": "ubuntu-desktop-control",
"args": [],
"env": {
"DISPLAY": ":0",
"XAUTHORITY": "/home/user/.Xauthority"
}
}
}
}
VS Code (.vscode/mcp.json)
{
"servers": {
"ubuntu-desktop-control": {
"command": "ubuntu-desktop-control",
"args": [],
"env": {
"DISPLAY": ":0"
}
}
}
}
Display Scaling & Coordinates
If clicks land in the wrong place, you likely have a HiDPI display scaling mismatch (e.g., logical 1920x1080 vs physical 3840x2160).
Solutions:
- Auto-scale: Use
click_screen(..., auto_scale=True)to let the server handle it. - Diagnostics: Run
get_display_diagnostics()to see the scaling factor. - Element IDs: Use
take_screenshot(detect_elements=True)and click viaelement_idor percentage coordinates.
Troubleshooting
<details> <summary><strong>Common Issues</strong></summary>
- "Screenshot failed": Ensure
gnome-screenshotorscrotis installed (sudo apt install gnome-screenshot). - "PyAutoGUI not installed": Ensure you are using the
.venvpython. - Wayland Issues: This server requires X11. Check with
echo $XDG_SESSION_TYPE. If "wayland", switch to "GNOME on Xorg" at login. - Permission Denied: Run
xhost +local:if you have X11 permission issues.
</details>
Security
ā ļø Warning: This server gives LLMs full control over your mouse and visibility of your screen.
- Only use with trusted clients.
- Be aware screenshots may capture sensitive data.
- Automated clicks can be destructive.
License
MIT License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.