linux-computer-use
Enables AI agents to control Linux/X11 desktops by providing tools for taking screenshots, clicking, typing, and managing windows via AT-SPI and xdotool.
README
<p align="center"> <img src="./assets/logo.png" width="40%" alt="linux-computer-use"> </p>
<h1 align="center">linux-computer-use</h1>
<p align="center"> <em>Linux/X11 computer-use tools for AI agents — Pi, Claude Code, OpenCode, and any MCP-aware client. AT-SPI + xdotool, ~1k LOC.</em> </p>
<p align="center"> <a href="LICENSE"><img alt="license" src="https://img.shields.io/github/license/tak-uukti/linux-computer-use?style=flat-square"></a> <img alt="platform" src="https://img.shields.io/badge/platform-linux%20%7C%20x11-blue?style=flat-square"> <img alt="node" src="https://img.shields.io/badge/node-%E2%89%A520.6-339933?style=flat-square&logo=node.js&logoColor=white"> <img alt="python" src="https://img.shields.io/badge/python-%E2%89%A53.10-3776ab?style=flat-square&logo=python&logoColor=white"> <a href="https://github.com/tak-uukti/linux-computer-use/releases"><img alt="release" src="https://img.shields.io/github/v/tag/tak-uukti/linux-computer-use?style=flat-square&label=release"></a> </p>
A Linux port of @injaneity/pi-computer-use. One bridge, three frontends:
- Pi extension for
mariozechner/pi-coding-agent - MCP server for Claude Code (and any other MCP host)
- MCP server for OpenCode
The macOS original uses Apple's Accessibility API + AppleScript + ScreenCaptureKit (~6,800 lines of Swift + TS). This port replaces the entire native layer with AT-SPI 2 + xdotool + scrot, ships a single ~470-line Python bridge, and trims the tool surface from 15 → 8 to keep prompts cheap.
| upstream macOS | this port | |
|---|---|---|
| Total LOC | ~6,866 | ~1,200 (-83%) |
| Tools registered | ~15 | 8 |
| Native helper | 2,065 lines Swift | 471 lines Python |
| Runtime deps | Swift toolchain, codesign | python3-gi, xdotool, wmctrl, scrot |
| Frontends | macOS only | Pi · Claude Code · OpenCode · any MCP client |
System dependencies (all installs)
# Debian/Ubuntu
sudo apt-get install -y python3 python3-gi gir1.2-atspi-2.0 xdotool wmctrl scrot
# Enable AT-SPI on the desktop session (GNOME)
gsettings set org.gnome.desktop.interface toolkit-accessibility true
X11 only — Wayland sessions cannot capture other-app windows or synthesize input via xdotool. Run a GNOME-on-Xorg, KDE-on-X11, or XFCE session.
Install
Option 1 — Pi (mariozechner/pi-coding-agent)
pi install git:github.com/tak-uukti/linux-computer-use@v0.2.0
The postinstall script writes a small bash wrapper to ~/.pi/agent/helpers/linux-computer-use/bridge that execs python3 bridge/bridge.py. No build step, no codesign, no native compile.
In a Pi session, call screenshot first — it picks the focused window, returns AT-SPI refs (@e1, @e2, …) plus a PNG, then you can click({ref:"@e3"}), set_text({ref:"@e2", text:"…"}), etc.
Option 2 — Claude Code (MCP)
Installable as an MCP server straight from GitHub via uvx (no clone, no manual venv):
claude mcp add linux-computer-use -- uvx --from git+https://github.com/tak-uukti/linux-computer-use linux-computer-use-mcp
Or, equivalently, drop this into your Claude Code MCP config file (~/.claude.json under mcpServers):
{
"mcpServers": {
"linux-computer-use": {
"command": "uvx",
"args": [
"--from",
"git+https://github.com/tak-uukti/linux-computer-use",
"linux-computer-use-mcp"
]
}
}
}
Restart Claude Code; the 8 tools (list_windows, screenshot, click, type_text, set_text, keypress, scroll, computer_actions) appear under the linux-computer-use namespace.
Option 3 — OpenCode (MCP)
Add to ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"linux-computer-use": {
"type": "local",
"command": [
"uvx",
"--from",
"git+https://github.com/tak-uukti/linux-computer-use",
"linux-computer-use-mcp"
],
"enabled": true
}
}
}
Restart OpenCode and the tools become available to the agent.
Tools
8 total. Schemas are deliberately terse — see extensions/computer-use.ts (Pi) or mcp_server/server.py (MCP).
| name | purpose |
|---|---|
list_windows |
enumerate visible X11 windows; returns @wN, title, pid, geometry, focus state |
screenshot |
focus a window, capture PNG, walk AT-SPI tree → @eN targets with role / name / bounds / capabilities |
click |
click @eN, @wN, or x,y; supports button and clickCount |
type_text |
xdotool-type literal text at the cursor |
set_text |
replace value of an @eN text/entry via AT-SPI EditableText (falls back to focus + Ctrl+A + type) |
keypress |
press keys/chords — ["Return"], ["Ctrl","A"], ["ctrl+l","Return"], etc. |
scroll |
scroll at ref/coords by pixel delta |
computer_actions |
batch up to 20 actions in a single call |
Architecture
┌──────────────────────────────────────────────┐
│ Pi Claude Code OpenCode │
└──────┬─────────────┬──────────────────┬──────┘
│ │ │
│ extension │ MCP stdio │ MCP stdio
▼ ▼ ▼
┌──────────────┐ ┌────────────────────────────┐
│ extensions/ │ │ mcp_server/server.py │
│ computer- │ │ FastMCP wrapper (8 tools) │
│ use.ts │ └─────────────┬──────────────┘
└──────┬───────┘ │
│ │
▼ ▼
┌──────────────────────────────────────────────┐
│ bridge/bridge.py newline-JSON over stdio │
│ AT-SPI walk · xdotool · wmctrl · scrot │
└──────────────────────────────────────────────┘
The AT-SPI walker is depth-capped (12) and element-capped (200) to keep prompts lean. Element bounds use SCREEN coords with a fallback to WINDOW coords + window offset (necessary for GTK4 / Xwayland which report SCREEN as 0,0).
Verified end-to-end
These captures are from the bridge running against a Xvfb :99 + openbox session, driving real Linux apps. Screenshots taken via scrot after the bridge issued the actions.
gnome-calculator — keypress flow
keypress: 7, +, 8, Return → display shows 15. 26 AT-SPI elements detected, every push button reports canPress: true and accurate bounds.
<p align="center"> <img src="./assets/screenshots/01-calc-keypress.png" width="60%"> </p>
gnome-calculator — AT-SPI @eN ref clicks
computer_actions: [click @e3, click @e7] (which the bridge resolves to push buttons "4" and "5") → display shows 45.
<p align="center"> <img src="./assets/screenshots/02-calc-ax-ref-click.png" width="60%"> </p>
gedit — full type_text round-trip
type_text: "Hello sir, … Linux X11 + AT-SPI + xdotool working end-to-end." → 169 characters typed. 190 AT-SPI elements found in gedit's window.
<p align="center"> <img src="./assets/screenshots/03-gedit-typed.png" width="80%"> </p>
gedit — clear and retype
keypress ctrl+a → keypress Delete → type_text "Taksheel". Status bar reads Ln 1, Col 9.
<p align="center"> <img src="./assets/screenshots/04-gedit-taksheel.png" width="80%"> </p>
App compatibility matrix
| App | screenshot | AT-SPI refs | input |
|---|---|---|---|
| gnome-calculator | ✅ | ✅ 26 elements, full action metadata | ✅ |
| gedit | ✅ | ✅ 190 elements | ✅ |
| GTK / Qt apps with AT-SPI | ✅ | ✅ | ✅ |
| Google Chrome / Chromium | ✅ | ⚠️ AT-SPI tree empty unless launched with --force-renderer-accessibility |
✅ (coords / keypress) |
| Firefox | ✅ | ✅ on a real session (gates on gsettings toolkit-accessibility) |
✅ |
| Electron apps | ✅ | ⚠️ same as Chrome — needs --force-renderer-accessibility |
✅ |
| LibreOffice (real Xorg session) | ✅ | ✅ via SAL_USE_COMMON_ONE_ACCESSIBILITY=1 |
✅ |
| Xvfb / nested X | ✅ | partial (some apps misbehave under Xvfb without a real session bus) | ✅ |
Limitations
- X11 only. Wayland sessions cannot capture other-app windows or synthesize input via xdotool.
- Apps must export AT-SPI for
@eNrefs to populate. Most GTK / Qt apps do; Electron / Chromium need--force-renderer-accessibility. - Mouse cursor physically moves — no stealth pointer on X11.
- Dropped vs upstream:
move_mouse,drag,wait,double_click,arrange_window,navigate_browser,list_apps. Usekeypress,type_text, andcomputer_actionsto compose what you need.
Development
git clone https://github.com/tak-uukti/linux-computer-use
cd linux-computer-use
# Pi side (TypeScript)
npm install
npm run typecheck
# Bridge sanity
python3 -c "import ast; ast.parse(open('bridge/bridge.py').read())"
echo '{"id":"1","cmd":"list_windows"}' | python3 bridge/bridge.py
# MCP side
python3 -m venv .venv && .venv/bin/pip install -e .
.venv/bin/linux-computer-use-mcp # speaks MCP over stdio
The Pi extension API surface is stubbed locally in src/types.ts so typecheck runs without @mariozechner/pi-coding-agent installed.
Layout
.
├── assets/ logo + screenshots
├── bridge/
│ ├── bridge.py 471-line Python helper (AT-SPI + xdotool + scrot)
│ └── requirements.txt
├── extensions/
│ └── computer-use.ts Pi tool registration + JSON schemas
├── mcp_server/
│ ├── __init__.py
│ └── server.py FastMCP wrapper around the bridge (8 tools)
├── scripts/
│ └── setup-helper.mjs Pi postinstall — writes ~/.pi/.../bridge wrapper
├── skills/computer-use/SKILL.md pi skill — Quick Start + Pitfalls
├── src/
│ ├── bridge.ts Pi-side subprocess manager + JSON-line protocol
│ └── types.ts local stubs for the pi-coding-agent extension API
├── package.json npm metadata (Pi extension)
├── pyproject.toml MCP server packaging (uvx-installable)
├── tsconfig.json
├── CHANGELOG.md
├── LICENSE
└── README.md
Credits
@injaneity/pi-computer-use— macOS original, design and protocol shape.@mariozechner/pi-coding-agent— the Pi agent.- Model Context Protocol — Claude Code / OpenCode interop.
- AT-SPI 2, xdotool, wmctrl, scrot — the Linux building blocks doing all the real work.
License
MIT © 2026 Tak1tak · built by Tak1tak
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.