vision-mcp
Enables AI-powered vision analysis using local Ollama models. Supports screenshot analysis, OCR, text detection, and health monitoring via MCP protocol.
README
Vision MCP Server ๐
A powerful Model Context Protocol (MCP) server that provides AI-powered vision analysis capabilities using local Ollama models. Analyze screenshots, extract text, detect UI elements, and debug applications with state-of-the-art vision language models.
๐ Features
- ๐ Vision Analysis - Analyze screenshots and describe UI state, errors, or issues
- ๐ OCR Extraction - Extract text from images using VLM or Tesseract
- ๐ฏ Text Detection - Find specific text with bounding boxes for automation
- ๐ฅ Health Monitoring - Check Ollama connection and available models
- ๐ Universal Integration - Works with Electron, Selenium, Playwright, and more
- ๐โโ๏ธ CLI Tool - Standalone command-line interface for any workflow
- โก High Performance - Optimized for 16GB+ VRAM with local models
๐ฆ Installation
Prerequisites
- Node.js 18+
- Ollama (for vision models)
- Claude Code (for MCP integration)
- 16GB+ VRAM recommended for optimal performance
Quick Setup
# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/vision-mcp.git
cd vision-mcp
# 2. Install dependencies
npm install
# 3. Install Ollama and models
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull llava:7b
# 4. Run setup script
chmod +x setup.sh
./setup.sh
# 5. Test installation
./vlm.mjs health
Claude Code Integration
Add to your ~/.claude.json in the mcpServers section:
{
"mcpServers": {
"vision": {
"type": "stdio",
"command": "/home/ice/vision-mcp-wrapper.sh",
"args": [],
"env": {}
}
}
}
๐ฏ Quick Start
CLI Usage
# Health check
vlm health
# Analyze screenshot
vlm describe --image screenshot.png --prompt "What errors are visible?"
# Extract text (OCR)
vlm ocr --image document.png --engine tesseract
# Find UI elements
vlm find --image app.png --query "Submit button"
MCP Usage in Claude
Use vision.describe to analyze this screenshot for errors
Use vision.find_text to locate the "Login" button
Use vision.ocr to extract all visible text
Use vision.health to check model status
๐ ๏ธ Available Tools
vision.describe
Analyze images and describe UI state, errors, or issues.
Parameters:
image_b64(string): Base64-encoded imageprompt(string, optional): Custom analysis promptmodel(string, optional): Ollama model to usemax_tokens(number, optional): Maximum response tokens
Example:
{
"image_b64": "iVBORw0KGgoAAAANSUhEU...",
"prompt": "Identify any error messages or broken UI elements",
"model": "llava:7b"
}
vision.ocr
Extract text from images using VLM or Tesseract.
Parameters:
image_b64(string): Base64-encoded imageengine(string, optional): "vlm" or "tesseract"model(string, optional): Model for VLM OCRstructured(boolean, optional): Return structured JSON
vision.find_text
Locate specific text and return bounding boxes.
Parameters:
image_b64(string): Base64-encoded imagequery(string): Text to search formodel(string, optional): Ollama model to usefuzzy(boolean, optional): Allow fuzzy matching
vision.health
Check Ollama connection and available models.
๐ Integration Examples
Electron CDP Integration
// Add to your Electron MCP server
server.registerTool(
"browser_vision_check",
{
title: "Analyze page with AI vision",
inputSchema: z.object({
prompt: z.string().default("Check for errors"),
fullPage: z.boolean().default(false)
})
},
async ({ prompt, fullPage }) => {
const page = await pickPage();
const screenshot = await page.screenshot({
type: "png",
encoding: "base64",
fullPage
});
// Call Vision MCP via Claude
return {
content: [{
type: "text",
text: JSON.stringify({
next_tool: "vision.describe",
args: { image_b64: screenshot, prompt }
})
}]
};
}
);
Selenium Integration
// Shell out to vlm CLI from Selenium MCP
async function analyzeSeleniumPage(driver, prompt) {
const screenshot = await driver.takeScreenshot();
const tmpPath = `/tmp/selenium-${Date.now()}.png`;
await fs.writeFile(tmpPath, Buffer.from(screenshot, "base64"));
const { execFile } = require("child_process");
return new Promise((resolve, reject) => {
execFile("/home/ice/vision-mcp/vlm.mjs",
["describe", "--image", tmpPath, "--prompt", prompt],
(error, stdout) => {
fs.unlink(tmpPath); // Cleanup
if (error) reject(error);
else resolve(JSON.parse(stdout));
}
);
});
}
Playwright in Docker
# docker-compose.yml
services:
playwright:
image: mcr.microsoft.com/playwright:v1.54.2-noble
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
- OLLAMA_HOST=host.docker.internal
// Direct Ollama call from container
async function analyzeWithVision(page, prompt) {
const screenshot = await page.screenshot({ encoding: "base64" });
const response = await fetch("http://host.docker.internal:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llava:7b",
prompt: `Analyze this UI screenshot: ${prompt}`,
images: [screenshot],
stream: false
})
});
const data = await response.json();
return data.response;
}
๐ค Supported Models
Recommended Models by Use Case
| Use Case | Model | Size | Strengths |
|---|---|---|---|
| General UI Analysis | llava:7b โ
|
4.1GB | Reliable, fast, good reasoning |
| OCR & Text Extraction | minicpm-v:8b-2.6 |
~8GB | State-of-the-art OCR accuracy |
| Document Analysis | qwen2.5vl:7b |
~8GB | Excellent for complex layouts |
| Lightweight | llava:7b |
4.1GB | Best speed/accuracy balance |
Model Installation
# Current default (installed)
ollama pull llava:7b
# Better OCR model
ollama pull minicpm-v:8b-2.6
# Best document analysis (when available)
ollama pull qwen2.5vl:7b
โ๏ธ Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST |
127.0.0.1 |
Ollama server host |
OLLAMA_PORT |
11434 |
Ollama server port |
VISION_MODEL |
llava:7b |
Default vision model |
OCR_MODEL |
llava:7b |
Default OCR model |
MAX_TOKENS |
1024 |
Maximum response tokens |
Model Switching
# Use different model temporarily
VISION_MODEL=minicpm-v:8b-2.6 vlm describe --image screenshot.png
# Or specify in tool call
vlm describe --image screenshot.png --model minicpm-v:8b-2.6
๐ง Development
Project Structure
vision-mcp/
โโโ vision-mcp.mjs # Main MCP server
โโโ vlm.mjs # CLI tool
โโโ package.json # Dependencies
โโโ setup.sh # Installation script
โโโ vision-mcp-wrapper.sh # MCP wrapper
โโโ integration-examples.md # Integration guides
โโโ docs/ # Additional documentation
Running Development Server
# Start Ollama
ollama serve &
# Test MCP server
node vision-mcp.mjs
# Test CLI tool
./vlm.mjs health
Adding New Models
- Pull model:
ollama pull model-name - Update defaults in
vision-mcp.mjsandvlm.mjs - Test with:
vlm describe --model model-name --image test.png
๐ Troubleshooting
Common Issues
๐ด "Ollama not responding"
ollama serve &
curl http://localhost:11434/api/tags
๐ด "Model not found"
ollama list
ollama pull llava:7b
๐ด "Tesseract not found"
# Fedora/RHEL
sudo dnf install tesseract tesseract-langpack-eng
# Ubuntu/Debian
sudo apt install tesseract-ocr
๐ด "Permission denied"
chmod +x vision-mcp.mjs vlm.mjs vision-mcp-wrapper.sh
Debug Mode
# Enable debug output
DEBUG=1 ./vlm.mjs describe --image test.png
# Check MCP server logs
journalctl --user -f | grep vision-mcp
Performance Optimization
# Keep models warm (optional)
curl http://localhost:11434/api/generate \
-d '{"model":"llava:7b","prompt":"warmup","keep_alive":"10m"}'
# Monitor GPU usage
nvidia-smi -l 1
๐ Benchmarks
Performance on 16GB VRAM
| Model | Memory Usage | Speed | Accuracy | Best For |
|---|---|---|---|---|
llava:7b |
~4GB | โกโกโก Fast | โญโญโญ Good | General use, UI analysis |
minicpm-v:8b-2.6 |
~8GB | โกโก Medium | โญโญโญโญ Excellent | OCR, text extraction |
qwen2.5vl:7b |
~8GB | โกโก Medium | โญโญโญโญ Excellent | Document analysis |
๐ค Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit changes:
git commit -am 'Add feature' - Push to branch:
git push origin feature-name - Submit a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Ollama - Local model runtime
- LLaVA - Vision language model
- MCP SDK - Protocol implementation
- Claude Code - Development environment
๐ Related Projects
- Electron CDP MCP - Electron automation
- Selenium MCP - Browser testing
- Playwright MCP - Web automation
Made with โค๏ธ for the Claude Code ecosystem
๐ Star this repo if you find it helpful! ๐ Report issues on GitHub ๐ฌ Join discussions in the community
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.