vision-mcp

vision-mcp

Enables AI-powered vision analysis using local Ollama models. Supports screenshot analysis, OCR, text detection, and health monitoring via MCP protocol.

Category
Visit Server

README

Vision MCP Server ๐Ÿ”

A powerful Model Context Protocol (MCP) server that provides AI-powered vision analysis capabilities using local Ollama models. Analyze screenshots, extract text, detect UI elements, and debug applications with state-of-the-art vision language models.

License: MIT Node.js Ollama

๐Ÿš€ Features

  • ๐Ÿ” Vision Analysis - Analyze screenshots and describe UI state, errors, or issues
  • ๐Ÿ“ OCR Extraction - Extract text from images using VLM or Tesseract
  • ๐ŸŽฏ Text Detection - Find specific text with bounding boxes for automation
  • ๐Ÿฅ Health Monitoring - Check Ollama connection and available models
  • ๐Ÿ”Œ Universal Integration - Works with Electron, Selenium, Playwright, and more
  • ๐Ÿƒโ€โ™‚๏ธ CLI Tool - Standalone command-line interface for any workflow
  • โšก High Performance - Optimized for 16GB+ VRAM with local models

๐Ÿ“ฆ Installation

Prerequisites

  • Node.js 18+
  • Ollama (for vision models)
  • Claude Code (for MCP integration)
  • 16GB+ VRAM recommended for optimal performance

Quick Setup

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/vision-mcp.git
cd vision-mcp

# 2. Install dependencies
npm install

# 3. Install Ollama and models
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull llava:7b

# 4. Run setup script
chmod +x setup.sh
./setup.sh

# 5. Test installation
./vlm.mjs health

Claude Code Integration

Add to your ~/.claude.json in the mcpServers section:

{
  "mcpServers": {
    "vision": {
      "type": "stdio",
      "command": "/home/ice/vision-mcp-wrapper.sh",
      "args": [],
      "env": {}
    }
  }
}

๐ŸŽฏ Quick Start

CLI Usage

# Health check
vlm health

# Analyze screenshot
vlm describe --image screenshot.png --prompt "What errors are visible?"

# Extract text (OCR)
vlm ocr --image document.png --engine tesseract

# Find UI elements
vlm find --image app.png --query "Submit button"

MCP Usage in Claude

Use vision.describe to analyze this screenshot for errors
Use vision.find_text to locate the "Login" button  
Use vision.ocr to extract all visible text
Use vision.health to check model status

๐Ÿ› ๏ธ Available Tools

vision.describe

Analyze images and describe UI state, errors, or issues.

Parameters:

  • image_b64 (string): Base64-encoded image
  • prompt (string, optional): Custom analysis prompt
  • model (string, optional): Ollama model to use
  • max_tokens (number, optional): Maximum response tokens

Example:

{
  "image_b64": "iVBORw0KGgoAAAANSUhEU...",
  "prompt": "Identify any error messages or broken UI elements",
  "model": "llava:7b"
}

vision.ocr

Extract text from images using VLM or Tesseract.

Parameters:

  • image_b64 (string): Base64-encoded image
  • engine (string, optional): "vlm" or "tesseract"
  • model (string, optional): Model for VLM OCR
  • structured (boolean, optional): Return structured JSON

vision.find_text

Locate specific text and return bounding boxes.

Parameters:

  • image_b64 (string): Base64-encoded image
  • query (string): Text to search for
  • model (string, optional): Ollama model to use
  • fuzzy (boolean, optional): Allow fuzzy matching

vision.health

Check Ollama connection and available models.

๐Ÿ”— Integration Examples

Electron CDP Integration

// Add to your Electron MCP server
server.registerTool(
  "browser_vision_check",
  {
    title: "Analyze page with AI vision",
    inputSchema: z.object({
      prompt: z.string().default("Check for errors"),
      fullPage: z.boolean().default(false)
    })
  },
  async ({ prompt, fullPage }) => {
    const page = await pickPage();
    const screenshot = await page.screenshot({ 
      type: "png", 
      encoding: "base64", 
      fullPage 
    });
    
    // Call Vision MCP via Claude
    return {
      content: [{
        type: "text",
        text: JSON.stringify({
          next_tool: "vision.describe",
          args: { image_b64: screenshot, prompt }
        })
      }]
    };
  }
);

Selenium Integration

// Shell out to vlm CLI from Selenium MCP
async function analyzeSeleniumPage(driver, prompt) {
  const screenshot = await driver.takeScreenshot();
  const tmpPath = `/tmp/selenium-${Date.now()}.png`;
  
  await fs.writeFile(tmpPath, Buffer.from(screenshot, "base64"));
  
  const { execFile } = require("child_process");
  return new Promise((resolve, reject) => {
    execFile("/home/ice/vision-mcp/vlm.mjs",
      ["describe", "--image", tmpPath, "--prompt", prompt],
      (error, stdout) => {
        fs.unlink(tmpPath); // Cleanup
        if (error) reject(error);
        else resolve(JSON.parse(stdout));
      }
    );
  });
}

Playwright in Docker

# docker-compose.yml
services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.54.2-noble
    extra_hosts:
      - "host.docker.internal:host-gateway"
    environment:
      - OLLAMA_HOST=host.docker.internal
// Direct Ollama call from container
async function analyzeWithVision(page, prompt) {
  const screenshot = await page.screenshot({ encoding: "base64" });
  
  const response = await fetch("http://host.docker.internal:11434/api/generate", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llava:7b",
      prompt: `Analyze this UI screenshot: ${prompt}`,
      images: [screenshot],
      stream: false
    })
  });
  
  const data = await response.json();
  return data.response;
}

๐Ÿค– Supported Models

Recommended Models by Use Case

Use Case Model Size Strengths
General UI Analysis llava:7b โœ… 4.1GB Reliable, fast, good reasoning
OCR & Text Extraction minicpm-v:8b-2.6 ~8GB State-of-the-art OCR accuracy
Document Analysis qwen2.5vl:7b ~8GB Excellent for complex layouts
Lightweight llava:7b 4.1GB Best speed/accuracy balance

Model Installation

# Current default (installed)
ollama pull llava:7b

# Better OCR model
ollama pull minicpm-v:8b-2.6

# Best document analysis (when available)
ollama pull qwen2.5vl:7b

โš™๏ธ Configuration

Environment Variables

Variable Default Description
OLLAMA_HOST 127.0.0.1 Ollama server host
OLLAMA_PORT 11434 Ollama server port
VISION_MODEL llava:7b Default vision model
OCR_MODEL llava:7b Default OCR model
MAX_TOKENS 1024 Maximum response tokens

Model Switching

# Use different model temporarily
VISION_MODEL=minicpm-v:8b-2.6 vlm describe --image screenshot.png

# Or specify in tool call
vlm describe --image screenshot.png --model minicpm-v:8b-2.6

๐Ÿ”ง Development

Project Structure

vision-mcp/
โ”œโ”€โ”€ vision-mcp.mjs          # Main MCP server
โ”œโ”€โ”€ vlm.mjs                 # CLI tool
โ”œโ”€โ”€ package.json            # Dependencies
โ”œโ”€โ”€ setup.sh               # Installation script
โ”œโ”€โ”€ vision-mcp-wrapper.sh  # MCP wrapper
โ”œโ”€โ”€ integration-examples.md # Integration guides
โ””โ”€โ”€ docs/                  # Additional documentation

Running Development Server

# Start Ollama
ollama serve &

# Test MCP server
node vision-mcp.mjs

# Test CLI tool
./vlm.mjs health

Adding New Models

  1. Pull model: ollama pull model-name
  2. Update defaults in vision-mcp.mjs and vlm.mjs
  3. Test with: vlm describe --model model-name --image test.png

๐Ÿ› Troubleshooting

Common Issues

๐Ÿ”ด "Ollama not responding"

ollama serve &
curl http://localhost:11434/api/tags

๐Ÿ”ด "Model not found"

ollama list
ollama pull llava:7b

๐Ÿ”ด "Tesseract not found"

# Fedora/RHEL
sudo dnf install tesseract tesseract-langpack-eng

# Ubuntu/Debian  
sudo apt install tesseract-ocr

๐Ÿ”ด "Permission denied"

chmod +x vision-mcp.mjs vlm.mjs vision-mcp-wrapper.sh

Debug Mode

# Enable debug output
DEBUG=1 ./vlm.mjs describe --image test.png

# Check MCP server logs
journalctl --user -f | grep vision-mcp

Performance Optimization

# Keep models warm (optional)
curl http://localhost:11434/api/generate \
  -d '{"model":"llava:7b","prompt":"warmup","keep_alive":"10m"}'

# Monitor GPU usage
nvidia-smi -l 1

๐Ÿ“Š Benchmarks

Performance on 16GB VRAM

Model Memory Usage Speed Accuracy Best For
llava:7b ~4GB โšกโšกโšก Fast โญโญโญ Good General use, UI analysis
minicpm-v:8b-2.6 ~8GB โšกโšก Medium โญโญโญโญ Excellent OCR, text extraction
qwen2.5vl:7b ~8GB โšกโšก Medium โญโญโญโญ Excellent Document analysis

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Commit changes: git commit -am 'Add feature'
  4. Push to branch: git push origin feature-name
  5. Submit a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

๐Ÿ”— Related Projects


Made with โค๏ธ for the Claude Code ecosystem

๐ŸŒŸ Star this repo if you find it helpful! ๐Ÿ› Report issues on GitHub ๐Ÿ’ฌ Join discussions in the community

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured