gui-user
An MCP server for external computer-use that launches, observes, and interacts with any X11 application via AT-SPI2 accessibility tree and xdotool input injection.
README
gui-user
An MCP server for external computer-use. Launch, observe, and interact with any X11 application via AT-SPI2 accessibility tree and xdotool input injection.
Unlike in-process testing frameworks, gui-user works externally — it can drive compiled C++ Qt/QML apps, GTK apps, Electron apps, or anything that renders on X11.
Installation
1. System packages
# Debian/Ubuntu — required
sudo apt install xvfb xdotool at-spi2-core dbus imagemagick libgirepository1.0-dev
# Optional — for VNC observation of the headless display
sudo apt install x11vnc tigervnc-viewer
# Optional — for OCR-based text detection in screenshots
sudo apt install tesseract-ocr
2. Install gui-user
Clone the repo and install in development mode:
git clone <repo-url> gui-user
cd gui-user
pip install -e .
This puts gui-user-mcp on your $PATH as the MCP server entry point.
3. Configure Claude Code
Add gui-user as a user-scope MCP server (available in all projects):
claude mcp add gui-user -s user -- gui-user-mcp
Or for a single project only, run from the project directory:
claude mcp add gui-user -- gui-user-mcp
Alternatively, you can create .mcp.json in the project root (this is shared via source control):
{
"mcpServers": {
"gui-user": {
"command": "gui-user-mcp"
}
}
}
Verify the server is connected:
claude mcp list
If using VS Code, reload the window (Ctrl+Shift+P → "Developer: Reload Window") after adding the server, then start a new conversation. Type /mcp in the chat panel to confirm gui-user appears.
Tools
| Tool | Description |
|---|---|
launch_app(binary, args, env, working_dir, width, height, timeout, display_mode, display, vnc) |
Launch any binary under an isolated Xvfb display or a visible local X11 display |
close_app() |
Close the app (display session stays alive for reuse) |
stop_display() |
Tear down the display session (Xvfb, D-Bus, VNC) |
get_app_status() |
Check if app is running, get PID/exit code/stderr |
screenshot(output_path?) |
Capture screen as base64 PNG |
list_ui_elements(role?, name?, visible_only?) |
Enumerate AT-SPI accessibility tree |
find_element(text?, role?, index?) |
Find element by label/role |
get_element_info(text?, role?, at_x?, at_y?) |
Detailed element properties or coordinate lookup |
click(x, y, button?) |
Click at screen coordinates |
click_element(text?, role?, index?, button?) |
Find element and click its center |
double_click(x, y, button?) |
Double-click at coordinates |
double_click_element(text?, role?, index?, button?) |
Find element and double-click |
hover(x, y) |
Move mouse to coordinates |
hover_element(text?, role?, index?) |
Move mouse to element center |
type_text(text) |
Type text into focused widget |
press_key(key, modifiers?) |
Key press (e.g., press_key("s", ["Ctrl"])) |
wait_for_idle(timeout?) |
Wait for CPU usage to settle |
wait_for_element(text?, role?, timeout?) |
Poll until element appears |
batch_actions(actions) |
Execute a sequence of actions in one call (avoids per-action round-trips) |
Example Workflow
# Launch any binary in the default isolated Xvfb session
launch_app(binary="/usr/bin/gnome-calculator")
# Launch on the operator's visible X11 desktop instead
launch_app(
binary="/usr/bin/gnome-calculator",
display_mode="local",
)
# Or target a specific local display explicitly
launch_app(
binary="/usr/bin/gnome-calculator",
display_mode="local",
display=":1",
)
# Discover UI elements
list_ui_elements()
# Find and click a button by its visible label
click_element(text="7", role="button")
click_element(text="+", role="button")
click_element(text="3", role="button")
click_element(text="=", role="button")
# Type text
type_text(text="hello world")
# Keyboard shortcuts
press_key(key="s", modifiers=["Ctrl"])
# Screenshot
screenshot(output_path="/tmp/result.png")
# Clean up
close_app()
Architecture
AI Assistant (Claude)
│ MCP Protocol (stdio)
▼
MCP Server (main.py)
│ Orchestrates:
├── DisplayManager (Xvfb/local X11 + D-Bus + AT-SPI)
├── ProcessManager (binary launch/monitor)
├── AccessibilityTree (AT-SPI2 element discovery)
├── ScreenshotCapture (ImageMagick import)
├── InputController (xdotool mouse/keyboard)
└── IdleWaiter (CPU-based idle detection)
│
▼
Target Application (any X11 binary)
Key Differences from qt-pilot
This project was forked from qt-pilot and redesigned:
| qt-pilot | gui-user | |
|---|---|---|
| Target apps | Python/PySide6 only | Any X11 binary |
| Discovery | objectName (requires code changes) |
AT-SPI accessibility tree (no code changes) |
| Interaction | In-process QTest | External xdotool |
| Architecture | Monkeypatch + socket IPC | External observation + input injection |
Running Tests
python3 -m unittest tests.test_integration tests.test_local_display -v
Observing the Headless Display (VNC)
Pass vnc=True to launch_app to start a view-only VNC server alongside the Xvfb display. This lets the operator watch what the AI is doing without interfering.
launch_app(binary="my_app", vnc=True)
# Response includes: "vnc_display": "localhost:5900"
To connect, run from any terminal:
gui-user-view
This auto-detects the running x11vnc and opens a VNC viewer. If x11vnc isn't running yet, it starts one on the first Xvfb display it finds. You can also pass a specific port: gui-user-view 5902
To connect manually: vncviewer localhost:<port>
Requirements: sudo apt install x11vnc tigervnc-viewer
Helper Commands
These are installed on your $PATH by pip install:
| Command | Description |
|---|---|
gui-user-view |
Auto-detect the running Xvfb display and open a VNC viewer. Starts x11vnc if needed. |
gui-user-stop |
Kill any lingering Xvfb, x11vnc, and at-spi2-registryd processes. Useful for cleanup after crashes or interrupted sessions. |
The underlying shell scripts (view-display.sh, stop-display.sh) are also available in the repo root.
Display Lifecycle
The display session (Xvfb + D-Bus + VNC) persists across app restarts. This means:
launch_app()creates the display on first call, reuses it on subsequent callsclose_app()terminates only the app — the display and VNC stay alivestop_display()tears down everything (Xvfb, D-Bus, VNC)
This lets the operator connect the VNC viewer once and watch across multiple app launch/close cycles.
Screenshot Gallery
Every screenshot() call auto-saves a timestamped PNG to .gui-user/screenshots/ in the current working directory. Browse this folder to review the full visual history of a session.
Local Display Mode
display_mode="local" reuses a real X11 display so the operator can watch the app while the MCP drives it.
- This mode is opt-in. The default remains an isolated
Xvfbsession. - Local mode is intended for X11 or XWayland displays only.
- Mouse, keyboard, and focus are shared with the operator, so runs are less deterministic.
widthandheightare ignored in local mode because the existing desktop geometry is reused.- For unattended or CI-style runs, prefer the default
Xvfbmode.
License
MIT License - see LICENSE file.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.