WinPilot Computer Use MCP

WinPilot Computer Use MCP

Enables AI agents to control Windows GUI applications like a human using screen capture, OCR, mouse and keyboard input, and window management, with safety levels and memory.

Category
Visit Server

README

WinPilot Computer Use MCP

WinPilot is a Windows computer-use MCP server for Codex-style agents. It operates GUI applications the same way a human does:

  • screen capture
  • computer vision
  • OCR
  • mouse movement
  • keyboard input
  • window management

It intentionally avoids application APIs, browser automation APIs, plugins, extensions, and application integrations. Chrome, Photoshop, Elementor, file dialogs, installers, and desktop apps are all treated as pixels plus OS input.

Status

This repository contains the first production-oriented implementation skeleton:

  • MCP tools for observation, element lookup, waiting, input, windows, screenshots, workflows, permissions, and task execution.
  • A vision pipeline with OCR, UI primitive detection, scrollable/dialog heuristics, screenshot diffing, annotated screenshots, and a semantic desktop model.
  • A guarded executor with per-action safety options, before/after screenshots, human-like mouse motion, keyboard entry, and structured logging.
  • Memory and workflow recording so successful layouts and demonstrations can be reused.

Optional advanced detectors such as PaddleOCR, YOLO, OmniParser, Florence2, and Grounding DINO are wired through extension points. The baseline works with local screenshots, OpenCV, Tesseract, and Windows input primitives.

Install

python -m venv .venv
. .venv\Scripts\Activate.ps1
pip install -e ".[dev]"

Install Tesseract OCR separately and make sure tesseract.exe is on PATH.

Optional vision stack:

pip install -e ".[vision]"

Run MCP Server

win-pilot-mcp

Or:

python -m win_pilot_mcp.mcp.server

For a background/local HTTP MCP endpoint:

$env:WIN_PILOT_MCP_TRANSPORT="streamable-http"
$env:WIN_PILOT_MCP_HOST="127.0.0.1"
$env:WIN_PILOT_MCP_PORT="8765"
python -m win_pilot_mcp.mcp.server

Endpoint:

http://127.0.0.1:8765/mcp

Core Loop

Every task is executed as:

  1. Observe the screen.
  2. Think and choose one next action.
  3. Act through mouse, keyboard, or window controls.
  4. Re-observe.
  5. Verify the result.
  6. Retry or recover if needed.

The planner never executes long blind action sequences.

Safety Levels

Actions are classified into:

  • read_only
  • standard
  • full_control
  • dangerous

Each action accepts:

{
  "dryRun": false,
  "requireConfirmation": true,
  "takeScreenshotBefore": null,
  "takeScreenshotAfter": null,
  "verificationMode": "auto"
}

The server defaults to standard, which allows normal navigation/input but blocks dangerous actions unless the permission level is raised. Screenshot verification defaults to auto: low value actions such as mouse move, scroll, focus, key press, hotkey, and wait skip before/after screenshots, while uncertain targets, text entry, clicks, drags, window changes, full-control, and dangerous actions still capture screenshots for accuracy. Set verificationMode to always or explicit takeScreenshotBefore / takeScreenshotAfter booleans to override.

MCP Tools

Representative tools:

  • analyze_screen
  • configure_optimization
  • clear_observation_cache
  • get_performance_stats
  • analyze_application
  • get_desktop_model
  • get_canvas_state
  • get_photoshop_state
  • get_elementor_state
  • get_browser_state
  • get_word_state
  • get_excel_state
  • get_powerpoint_state
  • get_vscode_state
  • get_illustrator_state
  • get_player_state
  • get_settings_state
  • list_supported_apps
  • get_shortcuts
  • run_app_shortcut
  • get_vision_providers
  • detect_objects
  • find_element
  • wait_for_element
  • wait_until_disappears
  • wait_until_stable
  • detect_state_changes
  • compare_screenshots
  • capture_screen
  • capture_region
  • create_annotated_screenshot
  • move_mouse, click, double_click, right_click, drag, drag_and_drop, draw_path
  • scroll
  • type_text, press_key, hotkey, hold_key, paste_text, select_all
  • list_windows, focus_window, maximize_window, resize_window
  • get_permission_level, set_permission_level
  • get_memory, remember_preference, remember_element, get_remembered_element
  • start_recording, stop_recording, list_workflows, learn_from_user, replay_workflow
  • read_logs
  • recover_from_unexpected_state
  • decide_next_action
  • plan_task
  • execute_task

Feature Coverage

  • Screen understanding: OCR text, buttons, icons, toolbars, menus, tabs, dropdowns, checkboxes, radio buttons, inputs, dialogs, images, canvas areas, loading indicators, context menus, notifications, file pickers, scrollables, selected elements, and a semantic desktop model.
  • Element lookup: text, type, description, image template, color, position, and remembered locations.
  • Vision stack: Tesseract/PaddleOCR OCR, OpenCV primitives and similarity, YOLO adapter, and explicit provider hooks for OmniParser, Florence2, and Grounding DINO.
  • Input: human-like mouse movement, click variants, scroll, drag/drop, drawing paths, text typing, key presses, hotkeys, key holds, paste, and select-all.
  • Windows: list, active window, focus, move, resize, maximize, minimize, and close.
  • Screenshots: full screen, region, window, comparison, change detection, annotated captures, and stability waits.
  • Agent loop: observe, plan one step, act, verify, recover, and retry. plan_task exposes the planned steps; execute_task executes one action at a time with re-observation.
  • Recovery: detects loading, dialogs, crashes, visible errors, and authentication blocks, then recommends the next recovery action.
  • Photoshop and Elementor: screenshot/OCR-only semantic state helpers for panels, canvas, widgets, navigator, publish controls, layers/properties/export dialogs, active tool, and inferred document size.
  • Professional app profiles: Word, Excel, PowerPoint, VSCode, Illustrator, media players, Windows Settings, and browsers expose semantic state helpers plus shortcut maps for faster actions without relying on app APIs.
  • Shortcut-first control: common commands such as word bold, excel format_cells, powerpoint new_slide, vscode command_palette, illustrator pen_tool, player play_pause, and settings open_settings map to keyboard shortcuts before falling back to mouse/vision.
  • Memory and workflows: remembered elements, preferences, action logs, macro recording, human demonstration capture, workflow listing, and replay.
  • Safety: read-only, standard, full-control, and dangerous permission levels, plus dryRun, requireConfirmation, and before/after screenshots on mutating actions.

Project Layout

src/win_pilot_mcp/
  mcp/
  agent/
  vision/
  executor/
  planner/
  memory/
  tools/
  workflows/
  permissions/
  logs/
  screenshots/

Runtime artifacts are written to runtime/ by default.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured