ControlMCP
An MCP server that enables LLMs to see and control a computer β screen capture, window management, mouse and keyboard automation β with a structured plan-execute workflow for complex desktop automation.
README
ControlMCP
πYouβre already a mature LLM, so you should learn to operate the computer by yourself. </br></br>π οΈMCP server for LLM-controlled computer operations β screen capture, window management, mouse & keyboard automation.
Overview
ControlMCP is a Model Context Protocol (MCP) server that gives LLMs the ability to see and control a computer β take screenshots, manage windows, move/click the mouse, type on the keyboard, and chain all of these into complex automation workflows.
The repository also ships with a reusable agent skill at skills/computer-control/. It packages desktop-operation SOPs, shortcut guidance, JetBrains IDE workflows, and screenshot-to-click coordinate rules for agents that support skills.
Quick Start
Installation
install from source:
git clone https://github.com/nix18/ControlMCP.git
cd ControlMCP
pip install -e .
Launch
control-mcp
The server communicates over stdio (standard MCP transport). Configure your MCP client to connect to the control-mcp command.
MCP Client Configuration
Add to your MCP client config (e.g. Claude Desktop, Cursor, etc.):
{
"mcpServers": {
"control-mcp": {
"command": "control-mcp",
"args": []
}
}
}
Tools (34 total)
Control Plane
| Tool | Description |
|---|---|
plan_desktop_task |
Convert a vague desktop instruction into a structured plan |
execute_desktop_plan |
Run a structured plan through the guarded executor |
get_execution_status |
Query the current status of a high-level execution run |
confirm_sensitive_action |
Explicitly approve or reject a sensitive action |
recover_execution_context |
Rebuild context after shortcut misuse or UI drift |
record_workflow_experience |
Persist reusable workflow experience |
Screen Capture
| Tool | Description |
|---|---|
capture_screen |
Full screen or monitor screenshot |
capture_region |
Region screenshot (x, y, width, height) |
capture_scroll_region |
Stitch a long screenshot while scrolling inside a fixed region |
get_screen_info |
List all monitors with resolution |
read_screenshot_base64 |
Read a screenshot file as Base64 text |
resolve_grid_target |
Convert a grid cell + anchor into precise screen coordinates |
click_grid_target |
Resolve screenshot grid metadata and click directly |
Window Management
| Tool | Description |
|---|---|
list_windows |
List all visible windows |
find_windows |
Find windows by title substring |
focus_window |
Bring a window to the foreground |
capture_window |
Focus + screenshot a specific window |
Mouse Control
| Tool | Description |
|---|---|
mouse_click |
Click at coordinates (single/double/multi/hold) |
mouse_drag |
Drag from point A to point B |
mouse_move |
Move cursor without clicking |
mouse_position |
Get current cursor position |
mouse_scroll |
Scroll wheel up/down |
Keyboard Control
| Tool | Description |
|---|---|
key_press |
Press keys or hotkey combinations |
key_hold |
Hold keys for a duration |
key_type |
Type text character by character |
key_sequence |
Execute a timed sequence of key actions |
Combined Operations
| Tool | Description |
|---|---|
mouse_and_keyboard |
Execute a mixed sequence of mouse + keyboard + wait + screenshot actions |
Additional Actions
| Tool | Description |
|---|---|
clipboard_get |
Get clipboard text |
clipboard_set |
Set clipboard text |
launch_app |
Launch an application |
launch_url |
Open a URL in the browser |
wait |
Pause for N seconds |
get_pixel_color |
Get RGB color at screen coordinates |
hotkey |
Press a keyboard shortcut |
Examples
See docs/TUTORIAL.md for comprehensive usage examples.
// Plan a vague desktop task first
{"tool": "plan_desktop_task", "args": {"instruction": "Switch to PyCharm and run the current config"}}
// Execute a generated plan
{"tool": "execute_desktop_plan", "args": {"plan_id": "plan_abc123"}}
// Take a screenshot
{"tool": "capture_screen", "args": {}}
// Take a sharper screenshot when text clarity matters
{"tool": "capture_window", "args": {"title": "PyCharm", "quality": 75, "sharpen": true}}
// Read that screenshot as Base64 text for non-multimodal models
{"tool": "read_screenshot_base64", "args": {"file_path": "/tmp/screen.jpg"}}
// Click at (500, 300)
{"tool": "mouse_click", "args": {"x": 500, "y": 300}}
// Combined: click β select all β type
{"tool": "mouse_and_keyboard", "args": {"actions": [
{"action": "click", "x": 500, "y": 300},
{"action": "key_press", "keys": ["ctrl", "a"]},
{"action": "key_type", "text": "New text"}
]}}
Rebuilt Workflow
ControlMCP now supports a control-plane-first workflow for higher precision desktop automation:
- Normalize the user instruction with
plan_desktop_task - Review or directly execute the structured plan
- Let the guarded executor choose a faster observation strategy (
capture_window/capture_region/capture_scroll_region) - Verify each critical step and recover when context is lost
- Require explicit confirmation for payment/password/asset-related actions
- Save successful workflow experience for future runs
For small or visually ambiguous targets, you can also ask capture_screen, capture_region,
or capture_window to generate a second grid_file_path overlay image with grid_rows and
grid_cols, then convert a chosen cell + anchor through resolve_grid_target before clicking.
Documentation
| Document | Description |
|---|---|
| README.md | This file |
| README.zh-CN.md | Chinese version of this file |
| docs/REQUIREMENTS.md | Requirements analysis |
| docs/ARCHITECTURE.md | Architecture design |
| docs/MODULE_DESIGN.md | Module design |
| docs/FUNCTIONAL_DESIGN.md | Functional design |
| docs/TUTORIAL.md | Tutorial & examples |
| skills/computer-control/ | Agent Skill: computer operation SOPs |
| skills/computer-control/README.md | Skill-specific install and usage guide |
| skills/computer-control/docs/window-management.md | Window rescue and window shortcut reference |
| skills/computer-control/docs/idea-run-workflow.md | JetBrains IDE run/log observation workflow |
Agent Skill
The skills/computer-control/ folder contains a ready-to-use Agent Skill that teaches LLMs how to operate computers proficiently.
What is included
SKILL.md: the main skill instructions, SOPs, shortcut tables, and common failure patternsdocs/coordinate-system.md: coordinate conversion reference for screenshot-to-click workflowsdocs/window-management.md: window maximize/restore/snap shortcuts and window recovery workflowdocs/idea-run-workflow.md: JetBrains IDE startup, run-panel switching, and log stabilization workflowREADME.md: skill-local installation and usage notes
What the skill covers
- Keyboard-first automation: prefer shortcuts over UI clicking whenever possible
- Plan-before-act control plane: normalize ambiguous instructions before touching the desktop
- Window recovery: fix minimized, half-screen, or partially restored windows before further actions
- Coordinate-safe clicking: convert screenshot-local coordinates into screen coordinates explicitly
- IDE workflows: IntelliJ IDEA / PyCharm run-configuration selection, run-panel switching, and log monitoring
- Sensitive-action gating: require confirmation before payment/password/asset-related steps
- Operational fallback: when JetBrains shortcuts do not behave as expected, check the local
ReferenceCard.pdfor JetBrains official documentation
Install the skill into your agent
You can either copy skills/computer-control/ into your agent's skill directory, or add it via a symbolic link.
Option 1: copy the directory
# Codex CLI
cp -r skills/computer-control ~/.codex/skills/
# Claude Code
cp -r skills/computer-control ~/.claude/skills/
# OpenCode
cp -r skills/computer-control ~/.config/opencode/skills/
Option 2: create a symbolic link
On macOS / Linux:
# Codex CLI
ln -s "$(pwd)/skills/computer-control" ~/.codex/skills/computer-control
# Claude Code
ln -s "$(pwd)/skills/computer-control" ~/.claude/skills/computer-control
# OpenCode
ln -s "$(pwd)/skills/computer-control" ~/.config/opencode/skills/computer-control
On Windows (Command Prompt as Administrator when required):
mklink /D "%USERPROFILE%\.codex\skills\computer-control" "%CD%\skills\computer-control"
mklink /D "%USERPROFILE%\.claude\skills\computer-control" "%CD%\skills\computer-control"
mklink /D "%USERPROFILE%\.config\opencode\skills\computer-control" "%CD%\skills\computer-control"
Using a symbolic link is convenient while iterating on the skill, because changes in this repository are reflected immediately in the agent's skills directory.
If your agent supports custom skill paths, you can also reference this folder directly.
Use the skill
After installation, invoke it naturally in prompts such as:
Use $computer-control to restart the IDEA app and wait until logs stop updatingUse $computer-control to maximize the target window and capture itUse $computer-control to operate PyCharm with keyboard shortcuts first
For skill-specific details, see skills/computer-control/README.md.
Project Structure
ControlMCP/
βββ README.md # This file
βββ README.zh-CN.md # Chinese README
βββ LICENSE # GNU GPLv3 license
βββ pyproject.toml # Package config
βββ src/
β βββ control_mcp/
β βββ __init__.py
β βββ server.py # MCP server + tool registration
β βββ schemas/
β β βββ __init__.py
β β βββ responses.py # Structured response types
β βββ tools/
β β βββ __init__.py
β β βββ screen.py # Screen capture tools
β β βββ window.py # Window management tools
β β βββ mouse.py # Mouse control tools
β β βββ keyboard.py # Keyboard control tools
β β βββ combined.py # Combined operations
β β βββ actions.py # Additional actions
β βββ utils/
β βββ __init__.py
β βββ capture.py # Capture utilities (JPEG, resize)
β βββ _win_window.py # Windows backend
β βββ _mac_window.py # macOS backend
β βββ _linux_window.py # Linux backend
βββ skills/
β βββ computer-control/ # Agent Skill: computer operation SOPs
β βββ SKILL.md # Main skill instructions
β βββ docs/
β β βββ coordinate-system.md # Coordinate system reference
β β βββ window-management.md # Window management reference
β β βββ idea-run-workflow.md # JetBrains IDE run/log workflow
β βββ README.md # Skill install & usage guide
βββ docs/
β βββ REQUIREMENTS.md
β βββ ARCHITECTURE.md
β βββ MODULE_DESIGN.md
β βββ FUNCTIONAL_DESIGN.md
β βββ TUTORIAL.md
β βββ zh-CN/ # Chinese documentation
β βββ REQUIREMENTS.md
β βββ ARCHITECTURE.md
β βββ MODULE_DESIGN.md
β βββ FUNCTIONAL_DESIGN.md
β βββ TUTORIAL.md
βββ tests/
βββ __init__.py
βββ test_schemas.py # 22 tests
βββ test_screen.py # 6 tests
βββ test_window.py # 11 tests
βββ test_mouse.py # 13 tests
βββ test_keyboard.py # 16 tests
βββ test_combined.py # 12 tests
βββ test_actions.py # 13 tests
Platform Support
| Platform | Screen Capture | Window Management | Mouse/Keyboard |
|---|---|---|---|
| Windows | β mss | β pygetwindow | β pyautogui |
| macOS | β mss | β Quartz | β pyautogui |
| Linux | β mss | β xlib | β pyautogui |
License
GNU General Public License v3.0 (GPLv3)
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.