npu-vision-fallback

npu-vision-fallback

Provides an MCP server for local low-power screen vision, enabling AI agents to perform OCR and UI detection on inaccessible screens (games, remote desktops) using NPU acceleration and system OCR.

Category
Visit Server

README

<div align="center">

๐Ÿ”‹ npu-vision-fallback

Local low-power vision for desktop AI agents

When accessibility APIs fail โ€” NPU-first, zero GPU wake-up, 100% local

CI PyPI License: MIT Python 3.11+

English | ไธญๆ–‡ๆ–‡ๆกฃ

</div>


English

What is this?

A lightweight, local-first vision service for desktop agents that need to see and interact with screens where traditional accessibility APIs fall shortโ€”games, remote desktops, canvas apps, and more.

Built for efficiency: Native OS OCR ยท Intel NPU acceleration ยท Zero cloud calls ยท Battery-friendly by design

<div align="center">

Architecture Diagram

</div>


โœจ Why Use This?

Desktop agents face a challenge: how to perceive UI when the accessibility tree is empty?

Common Approach The Problem
๐Ÿค– Multimodal LLM screenshots Expensive tokens, slow round-trips, coordinate hallucination
๐ŸŒณ OS Accessibility APIs only Blind to games, canvas apps, remote desktops, emulators
๐Ÿ”ฅ Heavy GPU OCR (PaddleOCR) Big dependencies, high power draw, wakes discrete GPU

npu-vision-fallback is your fallback layer โ€” when the accessibility tree comes back empty, this gives your agent a small, fast, local vision service that doesn't touch the cloud or spin up the dGPU.

Perfect for:

  • ๐ŸŽฎ Game UIs and emulators
  • ๐Ÿ–ฅ๏ธ Remote desktop / VNC clients (no remote accessibility tree)
  • ๐ŸŽจ Canvas / WASM web apps rendering outside the DOM
  • ๐Ÿ’ป Local SLMs that can't afford multimodal screenshot tokens

๐Ÿš€ Quick Start

1. Install (Windows + Intel NPU recommended)

pip install "npu-vision-fallback[ocr-win,detect]"
python scripts/download_ui_model.py  # One-time setup

2. Configure Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "npu-vision-fallback": {
      "command": "npu-vision-fallback"
    }
  }
}

3. Use it

Restart Claude Desktop and try:

You: The accessibility tree for this game is empty. Can you read the screen at coordinates [0,0,1280,800] and find the "Start Game" button?

Claude: (calls analyze_screen) I found a button labeled "Start Game" at [520, 580, 720, 640]. Want me to click its center at (620, 610)?


๐Ÿ“ฆ Installation Options

Windows (Recommended)

Native OCR + NPU UI detection (~85 MB total):

pip install "npu-vision-fallback[ocr-win,detect]"
python scripts/download_ui_model.py

Linux / macOS

Cross-platform OCR + CPU detection (~130 MB):

pip install "npu-vision-fallback[ocr-rapid,detect]"
python scripts/download_ui_model.py

Full (All Backends)

For development or testing all backends:

pip install "npu-vision-fallback[all]"
python scripts/download_ui_model.py

Minimal Core

Just the MCP server (no OCR/detection, ~20 MB):

pip install npu-vision-fallback

๐Ÿ’ก Note: The detect extra uses OpenVINO (~80 MB) for runtime, not PyTorch. Model conversion requires the dev-convert extra (~2 GB), but that's a one-time setup most users skip.


๐ŸŽฏ Key Features

  • ๐Ÿ”‹ NPU-first architecture โ€” UI detection runs on Intel AI Boost at ~80ms per call (~0.3J energy)
  • โšก Zero dGPU wake-up โ€” Default paths use NPU, system OCR, or CPUโ€”laptop battery stays happy
  • ๐ŸŒ Native OS OCR โ€” Uses Windows OCR engine (macOS Vision planned) for quality
  • ๐Ÿงฉ MCP protocol โ€” Works with Claude Desktop, Cursor, or any MCP client out of the box
  • ๐Ÿชถ Lightweight โ€” No PyTorch/TensorFlow at runtime; all heavy deps are optional
  • ๐Ÿ›ก๏ธ Privacy-first โ€” 100% local processing, no telemetry, no cloud

โšก Performance

Measured on Intel Core Ultra 9 275HX (2560ร—1600 screen, on battery):

Task Backend Latency Energy Notes
OCR WinOCR ~1100ms 2.5J Native Windows API (full screen)
OCR RapidOCR ~6300ms 14.5J Cross-platform ONNX CPU
UI Detection OpenVINO NPU ~80ms 0.3J YOLOv8n on Intel AI Boost
UI Detection OpenVINO CPU ~120ms โ€” Fallback when no NPU

Full benchmark details and reproduction steps: outputs/power_report.md


๐Ÿ› ๏ธ MCP Tools

Tool Purpose Key Arguments
health_check Server status โ€”
list_backends Available backends โ€”
ocr_region Extract text from region region=[x1,y1,x2,y2]
detect_ui Find UI elements region=[x1,y1,x2,y2]
analyze_screen ๐ŸŒŸ Combined OCR + detection region=[x1,y1,x2,y2]

analyze_screen is the primary tool โ€” it fuses detection + OCR, returns spatially-sorted elements with text annotations. Perfect for agent navigation.


๐Ÿ“š Documentation


๐Ÿงช Examples

Example Description
basic_ocr.py Simple OCR call to screen region
agent_ui_navigation.py Find and click UI elements
desktop_remote_vnc.py Vision fallback in remote desktop
uv run python examples/basic_ocr.py --region 0 0 1280 800

๐Ÿ—บ๏ธ Roadmap

  • v1.1 โ€” Multi-monitor support, DPI scaling awareness
  • v2.0 โ€” Custom model training interface, bring your own detector
  • v2.1 โ€” UI-TARS integration, macOS Vision backend, PP-OCR v4 on NPU

๐Ÿค Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines. Please read CLAUDE.mdโ€”it's the project constitution that ensures code quality and architectural consistency.


๐Ÿ“‹ Supported Backends

Backend Type Device Platform Status
winocr System OCR CPU/NPU Windows โœ… Primary
openvino_npu UI Detection NPU Win/Linux + Intel NPU โœ… Primary
openvino_cpu UI Detection CPU Win/Linux/macOS โœ… Fallback
rapid_ocr OCR CPU All โœ… Cross-platform
pytesseract OCR CPU All โœ… Last-resort
vision System OCR ANE macOS ๐Ÿšง Planned

๐Ÿ“„ License

MIT ยฉ npu-vision-fallback contributors


๐Ÿ™ Acknowledgments

Built with:

Development assisted by Claude Code (Anthropic). Architecture design and code review powered by AI collaboration.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured