npu-vision-fallback
Provides an MCP server for local low-power screen vision, enabling AI agents to perform OCR and UI detection on inaccessible screens (games, remote desktops) using NPU acceleration and system OCR.
README
<div align="center">
๐ npu-vision-fallback
Local low-power vision for desktop AI agents
When accessibility APIs fail โ NPU-first, zero GPU wake-up, 100% local
</div>
English
What is this?
A lightweight, local-first vision service for desktop agents that need to see and interact with screens where traditional accessibility APIs fall shortโgames, remote desktops, canvas apps, and more.
Built for efficiency: Native OS OCR ยท Intel NPU acceleration ยท Zero cloud calls ยท Battery-friendly by design
<div align="center">

</div>
โจ Why Use This?
Desktop agents face a challenge: how to perceive UI when the accessibility tree is empty?
| Common Approach | The Problem |
|---|---|
| ๐ค Multimodal LLM screenshots | Expensive tokens, slow round-trips, coordinate hallucination |
| ๐ณ OS Accessibility APIs only | Blind to games, canvas apps, remote desktops, emulators |
| ๐ฅ Heavy GPU OCR (PaddleOCR) | Big dependencies, high power draw, wakes discrete GPU |
npu-vision-fallback is your fallback layer โ when the accessibility tree comes back empty, this gives your agent a small, fast, local vision service that doesn't touch the cloud or spin up the dGPU.
Perfect for:
- ๐ฎ Game UIs and emulators
- ๐ฅ๏ธ Remote desktop / VNC clients (no remote accessibility tree)
- ๐จ Canvas / WASM web apps rendering outside the DOM
- ๐ป Local SLMs that can't afford multimodal screenshot tokens
๐ Quick Start
1. Install (Windows + Intel NPU recommended)
pip install "npu-vision-fallback[ocr-win,detect]"
python scripts/download_ui_model.py # One-time setup
2. Configure Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"npu-vision-fallback": {
"command": "npu-vision-fallback"
}
}
}
3. Use it
Restart Claude Desktop and try:
You: The accessibility tree for this game is empty. Can you read the screen at coordinates [0,0,1280,800] and find the "Start Game" button?
Claude: (calls
analyze_screen) I found a button labeled "Start Game" at [520, 580, 720, 640]. Want me to click its center at (620, 610)?
๐ฆ Installation Options
Windows (Recommended)
Native OCR + NPU UI detection (~85 MB total):
pip install "npu-vision-fallback[ocr-win,detect]"
python scripts/download_ui_model.py
Linux / macOS
Cross-platform OCR + CPU detection (~130 MB):
pip install "npu-vision-fallback[ocr-rapid,detect]"
python scripts/download_ui_model.py
Full (All Backends)
For development or testing all backends:
pip install "npu-vision-fallback[all]"
python scripts/download_ui_model.py
Minimal Core
Just the MCP server (no OCR/detection, ~20 MB):
pip install npu-vision-fallback
๐ก Note: The
detectextra uses OpenVINO (~80 MB) for runtime, not PyTorch. Model conversion requires thedev-convertextra (~2 GB), but that's a one-time setup most users skip.
๐ฏ Key Features
- ๐ NPU-first architecture โ UI detection runs on Intel AI Boost at ~80ms per call (~0.3J energy)
- โก Zero dGPU wake-up โ Default paths use NPU, system OCR, or CPUโlaptop battery stays happy
- ๐ Native OS OCR โ Uses Windows OCR engine (macOS Vision planned) for quality
- ๐งฉ MCP protocol โ Works with Claude Desktop, Cursor, or any MCP client out of the box
- ๐ชถ Lightweight โ No PyTorch/TensorFlow at runtime; all heavy deps are optional
- ๐ก๏ธ Privacy-first โ 100% local processing, no telemetry, no cloud
โก Performance
Measured on Intel Core Ultra 9 275HX (2560ร1600 screen, on battery):
| Task | Backend | Latency | Energy | Notes |
|---|---|---|---|---|
| OCR | WinOCR | ~1100ms | 2.5J | Native Windows API (full screen) |
| OCR | RapidOCR | ~6300ms | 14.5J | Cross-platform ONNX CPU |
| UI Detection | OpenVINO NPU | ~80ms | 0.3J | YOLOv8n on Intel AI Boost |
| UI Detection | OpenVINO CPU | ~120ms | โ | Fallback when no NPU |
Full benchmark details and reproduction steps:
outputs/power_report.md
๐ ๏ธ MCP Tools
| Tool | Purpose | Key Arguments |
|---|---|---|
health_check |
Server status | โ |
list_backends |
Available backends | โ |
ocr_region |
Extract text from region | region=[x1,y1,x2,y2] |
detect_ui |
Find UI elements | region=[x1,y1,x2,y2] |
analyze_screen |
๐ Combined OCR + detection | region=[x1,y1,x2,y2] |
analyze_screen is the primary tool โ it fuses detection + OCR, returns spatially-sorted elements with text annotations. Perfect for agent navigation.
๐ Documentation
- Architecture Guide โ System design and data flow
- Backend Reference โ Per-backend capabilities and priorities
- FAQ โ Common questions and troubleshooting
- Contributing โ How to contribute
- Code Guide โ Project constitution for contributors
๐งช Examples
| Example | Description |
|---|---|
basic_ocr.py |
Simple OCR call to screen region |
agent_ui_navigation.py |
Find and click UI elements |
desktop_remote_vnc.py |
Vision fallback in remote desktop |
uv run python examples/basic_ocr.py --region 0 0 1280 800
๐บ๏ธ Roadmap
- v1.1 โ Multi-monitor support, DPI scaling awareness
- v2.0 โ Custom model training interface, bring your own detector
- v2.1 โ UI-TARS integration, macOS Vision backend, PP-OCR v4 on NPU
๐ค Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines. Please read CLAUDE.mdโit's the project constitution that ensures code quality and architectural consistency.
๐ Supported Backends
| Backend | Type | Device | Platform | Status |
|---|---|---|---|---|
winocr |
System OCR | CPU/NPU | Windows | โ Primary |
openvino_npu |
UI Detection | NPU | Win/Linux + Intel NPU | โ Primary |
openvino_cpu |
UI Detection | CPU | Win/Linux/macOS | โ Fallback |
rapid_ocr |
OCR | CPU | All | โ Cross-platform |
pytesseract |
OCR | CPU | All | โ Last-resort |
vision |
System OCR | ANE | macOS | ๐ง Planned |
๐ License
MIT ยฉ npu-vision-fallback contributors
๐ Acknowledgments
Built with:
- Model Context Protocol (Anthropic) โ Agent integration layer
- OpenVINO โ NPU/CPU inference runtime
- Ultralytics YOLO โ UI detection models
- RapidOCR โ Cross-platform OCR engine
- Tesseract โ OCR fallback
- python-mss โ Screen capture library
Development assisted by Claude Code (Anthropic). Architecture design and code review powered by AI collaboration.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.