Screen Agent
A Windows desktop automation MCP server that enables UI recognition through OCR, UIA controls, and multi-point color matching. It allows agents to interact with desktop applications via actions like clicking and typing while using a learning system to track and improve operation success.
README
Screen Agent
Windows 桌面自动化 MCP 服务器,支持 OCR、UIA 控件和多点颜色匹配的 UI 识别。
功能特性
-
多种 UI 识别方式
- OCR 文字识别(RapidOCR)
- Windows UIA 控件识别
- 多点颜色特征匹配
-
智能操作
- 窗口绑定与自动聚焦
- 弹窗检测与处理
- 操作验证与错误恢复
-
学习与进化
- 技能学习系统
- 操作成功率追踪
- 向量数据库存储经验
安装
环境要求
- Windows 10/11
- Python 3.12+
- Ollama(可选,用于视觉识别)
安装步骤
# 克隆仓库
git clone https://github.com/lqszhsp/screen-agent.git
cd screen-agent
# 创建虚拟环境
python -m venv venv
venv\Scripts\activate
# 安装依赖
pip install -r requirements.txt
配置
- 复制配置模板:
copy config\settings.example.py config\settings.py
- 编辑
config/settings.py设置 API 密钥(如需使用云端视觉 API)
使用方法
作为 MCP 服务器
在 Claude Desktop 或其他 MCP 客户端中配置:
{
"mcpServers": {
"screen-agent": {
"command": "python",
"args": ["C:\\path\\to\\screen_agent\\mcp_server.py"]
}
}
}
可用工具
| 工具 | 说明 |
|---|---|
screen_get_layout |
绑定窗口,获取布局信息 |
screen_click |
点击屏幕元素 |
screen_input_text |
输入文字 |
screen_scroll |
滚动屏幕 |
screen_hotkey |
按下快捷键 |
screen_capture |
截图并识别元素 |
screen_wait |
等待指定时间 |
screen_explore |
自动探索界面 |
screen_detect_ui |
检测 UI 元素位置 |
screen_scan_ui_elements |
扫描并生成图标特征 |
screen_ask_user_locate |
请求用户帮助定位 |
screen_learn_success |
记录成功操作 |
screen_query_knowledge |
查询已学习知识 |
点击模式
# OCR 模式(默认)- 通过文字定位
screen_click(target="设置", mode="ocr")
# UIA 模式 - 通过控件定位
screen_click(target="确定", mode="ui", control_type="Button")
# 多点颜色模式 - 通过颜色特征定位
screen_click(mode="multipoint", features={"0|0": "#07c160", "10|10": "#ffffff"})
项目结构
screen_agent/
├── mcp_server.py # MCP 服务器入口
├── actions/ # 操作模块
│ ├── click.py # 点击操作
│ ├── input_text.py # 文字输入
│ ├── scroll.py # 滚动操作
│ └── ...
├── core/ # 核心模块
│ ├── perception.py # OCR 感知
│ ├── window_manager.py # 窗口管理
│ ├── evolution.py # 进化机制
│ └── ...
├── app_layouts/ # 程序布局文件
│ ├── _guidelines.md # 操作手册
│ ├── _template.md # 布局模板
│ ├── 微信.md # 微信布局
│ └── ...
└── config/ # 配置文件
└── settings.py
布局文件
每个程序可以有专属的布局文件(app_layouts/{程序名}.md),包含:
- 窗口结构和区域定义
- 常用元素位置
- 操作规范和限制
- 快捷键列表
参考 app_layouts/_template.md 创建新的布局文件。
技术文档
许可证
MIT License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.