hermes-computer-use

hermes-computer-use

Pixel-level browser automation MCP server that drives a real Chrome browser using screenshots as vision input and OS-level mouse/keyboard as output, evading anti-bot detection.

Category
Visit Server

README

hermes-computer-use

English · 日本語 · 中文 · 한국어

CI PyPI License: MIT Python 3.11+ Platform: WSL2 Ubuntu

Scope: Windows 11 + WSL2 Ubuntu 22.04 / 24.04 only. See docs/WSL_SETUP.md.

Pixel-level browser automation MCP server. Gives any MCP-speaking agent (hermes-agent, Claude Code, Codex, …) 21 tools to drive a real Chrome browser in an Xvfb display — screenshots as vision input, OS-level mouse/keyboard as output. No CDP. No navigator.webdriver. No DOM shortcuts.

<p align="center"><img src="docs/assets/demo-snp500.gif" alt="The agent opens Chrome, types 'snp500' into Google search, presses Enter. Google serves a normal results page including the live S&P 500 quote card. Entirely driven by pixel screenshots and xdotool." width="760"></p>

What the GIF shows — an agent opens Chrome, focuses the Google search bar, types snp500, presses Enter, and Google returns a full SERP with the live S&P 500 index card. The same flow routinely trips "unusual traffic" or a captcha for Playwright-driven automation. This stack doesn't get flagged because the browser is stock Chrome driven by stock X11 input — there is no automation fingerprint to detect.

Why this exists

Playwright / CDP hermes-computer-use
navigator.webdriver true (detectable) undefined
CDP endpoint open yes no
DOM access direct (fast, brittle to markup changes) screenshot only (slower, resilient to UI rewrites)
Anti-bot footprint large, constantly patched near-zero: stock Chrome + stock X input
Best for flows on sites you own agents operating unfamiliar sites like a human

If your automation has to walk a signup funnel on a site guarded by Cloudflare, Kasada, reCAPTCHA, or DataDome, this stack usually passes where Playwright gets stopped.

Evidence: docs/assets/demo-sannysoft.pngbot.sannysoft.com fingerprint panel with WebDriver, Chrome runtime, Permissions, Plugins, Languages, and PHANTOM all passed.

Architecture

agent ── stdio MCP ──▶ hermes_computer_use.server ── subprocess ──▶ xdotool / scrot
                                                                        │
                                                                        ▼
                                                                    Xvfb :99
                                                                        │
                                                     ┌──────────────────┼──────────────────┐
                                                     ▼                                     ▼
                                               x11vnc :5900                    websockify + noVNC :6080
                                          (native VNC clients)                 (browser viewer)

Longer version: docs/ARCHITECTURE.md.

Install

Prerequisites: Windows 11, WSL2 with Ubuntu 22.04/24.04, systemd enabled. Full walkthrough: docs/WSL_SETUP.md.

Everything below runs inside the WSL shell.

From PyPI

pip install "hermes-computer-use[novnc]"

You still need system packages (Xvfb, Chrome, xdotool…) and systemd units — see source install steps 1 & 4.

From source

git clone https://github.com/Noah3521/hermes-computer-use.git ~/hermes-computer-use
cd ~/hermes-computer-use

bash scripts/setup.sh                            # 1. apt + Chrome + uinput (sudo)
python3 -m venv .venv && . .venv/bin/activate
pip install -e ".[novnc]"                        # 2. Python package
bash scripts/install-novnc.sh                    # 3. (optional) web viewer

mkdir -p ~/.config/systemd/user                  # 4. persistent services
cp systemd/*.example ~/.config/systemd/user/
# edit the paths inside to match your clone, then:
sudo loginctl enable-linger "$USER"
systemctl --user daemon-reload
systemctl --user enable --now computer-use.service novnc.service

Smoke test: python examples/smoke_test.py.

Wire to an MCP client

Copy the relevant snippet from config/hermes.yaml.example into your agent's MCP server config. Works with hermes-agent, Claude Code, Codex, mcp-inspector, or any stdio MCP client.

Hand the install to an LLM

If your agent has shell + filesystem tools, you can skip the manual install entirely: paste the prompt in docs/LLM_SETUP_PROMPT.md and it will clone, install, wire up systemd, run the smoke test, and report back. Available in English, 日本語, 中文, 한국어.

Tools (30)

Category Tools
Status screen_info, cursor_position
Capture screenshot
Pointer move, left_click, right_click, double_click, middle_click, drag, scroll
Keyboard type_text, press_key, hold_key, clear_field, select_all, copy, paste, cut, undo, redo, clipboard_set, clipboard_get
Timing wait
Browser open_url, new_tab, close_tab, back, forward, reload
Escape hatch run_shell
Optional DOM fast-path (CU_ENABLE_CDP=1) dom_click, dom_type, dom_query, dom_exists, dom_wait, dom_eval, network_capture, console_messages

press_key accepts case-insensitive names and aliases — Backspace, backspace, BackSpace all work; cmd+a, command-a, ctrl+a all resolve; meta/win/windows/cmd map to Super.

Opt-in DOM fast-path

For DOM-heavy pages where vision grounding is slow or fragile (SPA dashboards, deep forms), you can opt into CSS-selector-based clicks / typing / queries. Trade-off: Chrome exposes a DevTools port and navigator.webdriver flips to true for the session, which defeats the anti-bot posture on sites that fingerprint Chrome. Off by default.

CU_ENABLE_CDP=1 bash scripts/display.sh restart
pip install "hermes-computer-use[dom]"        # adds websocket-client
# Run the MCP with CU_ENABLE_CDP=1 in its env too (hermes config etc.)

See docs/ARCHITECTURE.md#dom-fast-path for when to use which.

Demo prompts

Try any of the prompts in examples/demo_prompts.md. The simplest and most illustrative:

"Use computer_use to open Google, search for snp500, and tell me the current S&P 500 index price from the page."

Open http://localhost:6080/vnc.html in a browser while the agent runs — watching the cursor arc through the search bar is surprisingly compelling.

Configuration (env vars)

Var Default Meaning
CU_DISPLAY 99 X display number
CU_WIDTH / CU_HEIGHT 1440 / 900 Virtual screen size
CU_VNC_PORT 5900 x11vnc listen port
CU_STATE_DIR /tmp/hermes-computer-use Logs, PID files
CU_PROFILE_DIR $CU_STATE_DIR/chrome-profile Persistent Chrome profile
CU_START_URL about:blank First URL Chrome opens
CU_INPUT xdotool Set to ydotool for /dev/uinput input
CU_KEY_DELAY_MS 25 Inter-keystroke delay
CU_MOVE_STEPS 18 Cursor interpolation steps

Docs

  • WSL_SETUP.md — Windows-side setup, systemd, linger
  • ARCHITECTURE.md — internals + design rationale
  • CAPTCHA.md — what passive / behavioural / visual challenges this approach can and cannot handle
  • TROUBLESHOOTING.md — common failure modes with fixes
  • FAQ.md — Playwright comparison, anti-bot honesty, parallel runs, profile safety
  • SECURITY.md — threat model and hardening checklist

Security

This is an LLM with hands. Read SECURITY.md. Baseline:

  • Run in an isolated WSL distro, not your daily driver.
  • Strip run_shell if the agent doesn't need shell access.
  • Don't persist real credentials in CU_PROFILE_DIR.

Contributing

See CONTRIBUTING.md. The guiding thesis is "emit no abnormal signals by default" > "emit clever evasions" — but additive hybrid paths (e.g. opt-in DOM / CDP fast-clicks that users turn on per-site) are welcome when they do not flip the default posture.

License

MIT. See LICENSE.

Acknowledgements

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured