MCP Servers

vision-mcp

Enables AI-powered vision analysis using local Ollama models. Supports screenshot analysis, OCR, text detection, and health monitoring via MCP protocol.

README

Vision MCP Server 🔍

A powerful Model Context Protocol (MCP) server that provides AI-powered vision analysis capabilities using local Ollama models. Analyze screenshots, extract text, detect UI elements, and debug applications with state-of-the-art vision language models.

🚀 Features

🔍 Vision Analysis - Analyze screenshots and describe UI state, errors, or issues
📝 OCR Extraction - Extract text from images using VLM or Tesseract
🎯 Text Detection - Find specific text with bounding boxes for automation
🏥 Health Monitoring - Check Ollama connection and available models
🔌 Universal Integration - Works with Electron, Selenium, Playwright, and more
🏃‍♂️ CLI Tool - Standalone command-line interface for any workflow
⚡ High Performance - Optimized for 16GB+ VRAM with local models

📦 Installation

Prerequisites

Node.js 18+
Ollama (for vision models)
Claude Code (for MCP integration)
16GB+ VRAM recommended for optimal performance

Quick Setup

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/vision-mcp.git
cd vision-mcp

# 2. Install dependencies
npm install

# 3. Install Ollama and models
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull llava:7b

# 4. Run setup script
chmod +x setup.sh
./setup.sh

# 5. Test installation
./vlm.mjs health

Claude Code Integration

Add to your ~/.claude.json in the mcpServers section:

{
  "mcpServers": {
    "vision": {
      "type": "stdio",
      "command": "/home/ice/vision-mcp-wrapper.sh",
      "args": [],
      "env": {}
    }
  }
}

🎯 Quick Start

CLI Usage

# Health check
vlm health

# Analyze screenshot
vlm describe --image screenshot.png --prompt "What errors are visible?"

# Extract text (OCR)
vlm ocr --image document.png --engine tesseract

# Find UI elements
vlm find --image app.png --query "Submit button"

MCP Usage in Claude

Use vision.describe to analyze this screenshot for errors
Use vision.find_text to locate the "Login" button  
Use vision.ocr to extract all visible text
Use vision.health to check model status

🛠️ Available Tools

`vision.describe`

Analyze images and describe UI state, errors, or issues.

Parameters:

image_b64 (string): Base64-encoded image
prompt (string, optional): Custom analysis prompt
model (string, optional): Ollama model to use
max_tokens (number, optional): Maximum response tokens

Example:

{
  "image_b64": "iVBORw0KGgoAAAANSUhEU...",
  "prompt": "Identify any error messages or broken UI elements",
  "model": "llava:7b"
}

`vision.ocr`

Extract text from images using VLM or Tesseract.

Parameters:

image_b64 (string): Base64-encoded image
engine (string, optional): "vlm" or "tesseract"
model (string, optional): Model for VLM OCR
structured (boolean, optional): Return structured JSON

`vision.find_text`

Locate specific text and return bounding boxes.

Parameters:

image_b64 (string): Base64-encoded image
query (string): Text to search for
model (string, optional): Ollama model to use
fuzzy (boolean, optional): Allow fuzzy matching

`vision.health`

Check Ollama connection and available models.

🔗 Integration Examples

Electron CDP Integration

// Add to your Electron MCP server
server.registerTool(
  "browser_vision_check",
  {
    title: "Analyze page with AI vision",
    inputSchema: z.object({
      prompt: z.string().default("Check for errors"),
      fullPage: z.boolean().default(false)
    })
  },
  async ({ prompt, fullPage }) => {
    const page = await pickPage();
    const screenshot = await page.screenshot({ 
      type: "png", 
      encoding: "base64", 
      fullPage 
    });
    
    // Call Vision MCP via Claude
    return {
      content: [{
        type: "text",
        text: JSON.stringify({
          next_tool: "vision.describe",
          args: { image_b64: screenshot, prompt }
        })
      }]
    };
  }
);

Selenium Integration

// Shell out to vlm CLI from Selenium MCP
async function analyzeSeleniumPage(driver, prompt) {
  const screenshot = await driver.takeScreenshot();
  const tmpPath = `/tmp/selenium-${Date.now()}.png`;
  
  await fs.writeFile(tmpPath, Buffer.from(screenshot, "base64"));
  
  const { execFile } = require("child_process");
  return new Promise((resolve, reject) => {
    execFile("/home/ice/vision-mcp/vlm.mjs",
      ["describe", "--image", tmpPath, "--prompt", prompt],
      (error, stdout) => {
        fs.unlink(tmpPath); // Cleanup
        if (error) reject(error);
        else resolve(JSON.parse(stdout));
      }
    );
  });
}

Playwright in Docker

# docker-compose.yml
services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.54.2-noble
    extra_hosts:
      - "host.docker.internal:host-gateway"
    environment:
      - OLLAMA_HOST=host.docker.internal

// Direct Ollama call from container
async function analyzeWithVision(page, prompt) {
  const screenshot = await page.screenshot({ encoding: "base64" });
  
  const response = await fetch("http://host.docker.internal:11434/api/generate", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llava:7b",
      prompt: `Analyze this UI screenshot: ${prompt}`,
      images: [screenshot],
      stream: false
    })
  });
  
  const data = await response.json();
  return data.response;
}

🤖 Supported Models

Recommended Models by Use Case

Use Case	Model	Size	Strengths
General UI Analysis	`llava:7b` ✅	4.1GB	Reliable, fast, good reasoning
OCR & Text Extraction	`minicpm-v:8b-2.6`	~8GB	State-of-the-art OCR accuracy
Document Analysis	`qwen2.5vl:7b`	~8GB	Excellent for complex layouts
Lightweight	`llava:7b`	4.1GB	Best speed/accuracy balance

Model Installation

# Current default (installed)
ollama pull llava:7b

# Better OCR model
ollama pull minicpm-v:8b-2.6

# Best document analysis (when available)
ollama pull qwen2.5vl:7b

⚙️ Configuration

Environment Variables

Variable	Default	Description
`OLLAMA_HOST`	`127.0.0.1`	Ollama server host
`OLLAMA_PORT`	`11434`	Ollama server port
`VISION_MODEL`	`llava:7b`	Default vision model
`OCR_MODEL`	`llava:7b`	Default OCR model
`MAX_TOKENS`	`1024`	Maximum response tokens

Model Switching

# Use different model temporarily
VISION_MODEL=minicpm-v:8b-2.6 vlm describe --image screenshot.png

# Or specify in tool call
vlm describe --image screenshot.png --model minicpm-v:8b-2.6

🔧 Development

Project Structure

vision-mcp/
├── vision-mcp.mjs          # Main MCP server
├── vlm.mjs                 # CLI tool
├── package.json            # Dependencies
├── setup.sh               # Installation script
├── vision-mcp-wrapper.sh  # MCP wrapper
├── integration-examples.md # Integration guides
└── docs/                  # Additional documentation

Running Development Server

# Start Ollama
ollama serve &

# Test MCP server
node vision-mcp.mjs

# Test CLI tool
./vlm.mjs health

Adding New Models

Pull model: ollama pull model-name
Update defaults in vision-mcp.mjs and vlm.mjs
Test with: vlm describe --model model-name --image test.png

🐛 Troubleshooting

Common Issues

🔴 "Ollama not responding"

ollama serve &
curl http://localhost:11434/api/tags

🔴 "Model not found"

ollama list
ollama pull llava:7b

🔴 "Tesseract not found"

# Fedora/RHEL
sudo dnf install tesseract tesseract-langpack-eng

# Ubuntu/Debian  
sudo apt install tesseract-ocr

🔴 "Permission denied"

chmod +x vision-mcp.mjs vlm.mjs vision-mcp-wrapper.sh

Debug Mode

# Enable debug output
DEBUG=1 ./vlm.mjs describe --image test.png

# Check MCP server logs
journalctl --user -f | grep vision-mcp

Performance Optimization

# Keep models warm (optional)
curl http://localhost:11434/api/generate \
  -d '{"model":"llava:7b","prompt":"warmup","keep_alive":"10m"}'

# Monitor GPU usage
nvidia-smi -l 1

📊 Benchmarks

Performance on 16GB VRAM

Model	Memory Usage	Speed	Accuracy	Best For
`llava:7b`	~4GB	⚡⚡⚡ Fast	⭐⭐⭐ Good	General use, UI analysis
`minicpm-v:8b-2.6`	~8GB	⚡⚡ Medium	⭐⭐⭐⭐ Excellent	OCR, text extraction
`qwen2.5vl:7b`	~8GB	⚡⚡ Medium	⭐⭐⭐⭐ Excellent	Document analysis

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Commit changes: git commit -am 'Add feature'
Push to branch: git push origin feature-name
Submit a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Ollama - Local model runtime
LLaVA - Vision language model
MCP SDK - Protocol implementation
Claude Code - Development environment

🔗 Related Projects

Electron CDP MCP - Electron automation
Selenium MCP - Browser testing
Playwright MCP - Web automation

Made with ❤️ for the Claude Code ecosystem

🌟 Star this repo if you find it helpful! 🐛 Report issues on GitHub 💬 Join discussions in the community

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured