vision-mcp

vision-mcp

Gives text-only coding agents the ability to 'see' images, videos, and screenshots by routing them to a vision model and returning structured text.

Category
Visit Server

README

vision-mcp

Give eyes to any "blind" coding agent — a local MCP server that lets text-only models see screenshots, error images, design mockups and video.

给所有看不见图的编码 agent 装上眼睛 —— 一个本地 MCP 服务,让纯文本模型也能看懂截图、报错图、设计稿和视频。

English · 中文

It hands images to a vision model "out of band" and returns text + machine-readable metadata to the host, so a host that can't see images (GLM coding model, DeepSeek, Qwen-coder, local models…) effectively gains sight. Works with any OpenAI-compatible vision backend (GLM / Kimi / Xiaomi MiMo / OpenAI / local vLLM), and adds server-led agentic zoom to read details that downsampling would otherwise lose.


English

Features

  • 8 task-specific tools: UI→code, OCR, error diagnosis, diagram understanding, chart reading, UI diff, generic image, video.
  • Any OpenAI-compatible backend via one provider + per-backend profile. Switch with 3 env vars. GLM by default.
  • Agentic auto-zoom: server crops & upscales coarse→fine (grid → grounding → precise crop) from the full-resolution original, so tiny text becomes legible. Verified to fix hallucination on small detail (see Verified).
  • Universal video: backends without native video automatically get ffmpeg frame-sampling → multi-image.
  • Dual output: content (structured markdown) + structuredContent (confidence / regions / rounds / warnings / provider / model).
  • Safe by default: local-path allowlist, URL download + SSRF check (no blind passthrough), magic-byte validation, size caps.

How it works (30-second mental model)

host tool(image, detail_level)
  → validate + load media (full-res original + downsampled overview; path allowlist; URL SSRF)
  → overview ? single pass : zoomLoop (deterministic grid → model votes/grounding → crop original → early-exit)
  → content(markdown) + structuredContent(metadata)

The vision model is a "consultant" hired for one question; only its text answer comes back to the host. Crops always come from the full-resolution original (never the downsampled overview), so zoom actually recovers detail.

Quick start

1. Install & build (Node ≥ 20)

npm install
npm run build

2. Pick a backend. Default is GLM (z.ai). Any OpenAI-compatible /chat/completions endpoint with image_url support works — set 3 env vars:

export VISION_API_KEY=your_key
export VISION_BASE_URL=https://api.z.ai/api/paas/v4
export VISION_MODEL=glm-4.6v

3. Connect your MCP client (zcode / Cline / Cursor / Claude Desktop …). Add to its mcpServers config:

{
  "mcpServers": {
    "vision": {
      "command": "node",
      "args": ["<ABSOLUTE_PATH>/dist/index.js"],
      "env": {
        "VISION_API_KEY": "your_key",
        "VISION_BASE_URL": "https://api.z.ai/api/paas/v4",
        "VISION_MODEL": "glm-4.6v",
        "VISION_ALLOWED_DIRS": "<ABSOLUTE_PATH_TO_YOUR_PROJECT>"
      }
    }
  }
}

Timeout: reasoning vision models are slow (code/spec generation can take 30–60s+). Set your host's MCP tool timeout generously (≥120s). During deep zoom the server emits notifications/progress, so clients that honor resetTimeoutOnProgress stay alive.

Usage tutorial

The host is a text-only agent — it never sees the image. You tell the agent where the image is (a path under VISION_ALLOWED_DIRS, or a URL / data URI) and which tool to use; the agent calls the tool, the server returns text, the agent continues.

Example 1 — diagnose an error screenshot

  1. Save your screenshot at e.g. ./screenshots/error.png (inside an allowed dir).
  2. Ask your agent: "Use the vision MCP diagnose_error_screenshot on ./screenshots/error.png."
  3. The tool returns markdown sections ## Root cause / ## Verbatim error / ## Location / ## Fix steps, and the agent uses it to write the fix.

Example 2 — read a tiny detail (agentic zoom)

For small text the model can't read at a glance, pass detail_level: "fine":

// the call your agent makes
{ "name": "extract_text_from_screenshot",
  "arguments": { "image": "./screenshots/big.png", "detail_level": "fine",
                 "question": "read the small key code in the bottom-right corner" } }

The server runs the zoom loop: it grids the image, the model votes which region matters, the server crops that region from the original and re-reads it — recovering text that a single overview pass would misread. structuredContent.regions shows where it looked, rounds how many passes it took.

detail_level values: overview (single fast pass) · normal · fine (deep zoom) · auto (default — zooms only when needed, early-exits when clear).

Example 3 — analyze a video

{ "name": "video_analysis",
  "arguments": { "video": "./clips/repro.mp4", "question": "what bug is shown?" } }

If the backend has no native video, the server samples frames with ffmpeg and analyzes them as images — so video works on any vision backend.

Quick local test (no client needed) — the MCP Inspector:

npx @modelcontextprotocol/inspector node dist/index.js
# then set the env vars in the Inspector UI and call any tool

Or run the bundled scripts: node scripts/smoke.mjs (lists tools), KEY=... node scripts/livetest.mjs (drives all 8 tools end-to-end).

Tools

Tool Purpose
ui_to_artifact UI screenshot → code / spec
extract_text_from_screenshot verbatim OCR
diagnose_error_screenshot error diagnosis (root cause / verbatim / location / fix)
understand_technical_diagram architecture / flow / UML / ER / sequence diagrams
analyze_data_visualization read charts / dashboards
ui_diff_check compare two UI screenshots
image_analysis generic image understanding (fallback)
video_analysis video understanding (native or frame-sampled)

Common params: detail_level, question, region, thinking.

Backends

Backend VISION_PROFILE VISION_BASE_URL VISION_MODEL
GLM (z.ai) glm https://api.z.ai/api/paas/v4 glm-4.6v
Kimi (Moonshot) kimi https://api.moonshot.ai/v1 <kimi vision model>
Xiaomi MiMo mimo <your endpoint>/v1 mimo-v2.5 (vision build)
OpenAI openai https://api.openai.com/v1 gpt-4o
local vLLM/Ollama generic http://localhost:8000/v1 <local VLM>

Any OpenAI-compatible endpoint with image_url support works with generic.

Configuration (env)

Var Default Notes
VISION_API_KEY (alias Z_AI_API_KEY) — required backend key
VISION_PROFILE glm glm / kimi / mimo / openai / generic
VISION_BASE_URL per profile OpenAI-compatible endpoint
VISION_MODEL per profile vision model id
VISION_ALLOWED_DIRS cwd dirs allowed for local image paths (;/: separated)
VISION_ALLOW_URL_PASSTHROUGH false forward image URLs to the backend (default: download + SSRF-check → data URI)
VISION_MAX_ZOOM_ROUNDS 3 max agentic zoom rounds
VISION_MAX_EDGE_PX 1568 overview downsample longest edge
VISION_VIDEO_FRAMES 8 frames sampled per video
VISION_MAX_IMAGE_MB / _VIDEO_MB 10 / 50 size caps

Verified live

Tested against Xiaomi MiMo (mimo-v2.5 vision / mimo-v2.5-pro blind):

  • All 8 tools work end-to-end.
  • Agentic zoom proven: a tiny key code in a corner — overview hallucinated it (REV: 2A-74A1-0); detail_level=fine navigated to the bottom-right and read the correct KEY: ZX-7741-Q consistently across re-runs.
  • Video frame-sampling: mimo has no native video → auto frame-sampling succeeded.
  • Blind model degrades gracefully: mimo-v2.5-pro returns 404 No endpoints found that support image input → tool returns isError + a clear message, no crash.

Reproduce: KEY=... PROFILE=mimo MODEL=mimo-v2.5 BASE=<endpoint>/v1 node scripts/livetest.mjs.

Privacy

Your images/video are sent to the configured backend API. Don't use untrusted backends for sensitive content; a local deployment (generic profile → local vLLM) keeps data on your machine.

Development

npm run dev    # run source via tsx
npm run build  # tsc → dist/
npm test       # vitest, fully offline (no API key) — 33 tests

Tests cover the zoom state machine (early-exit / budget / parse-fail / out-of-bounds / grounding / tool-calling), media security (magic-bytes / path traversal / SSRF / downscale), tool schemas, and video frame sampling.

Architecture

src/: provider/ (OpenAI-compatible client + profiles), core/zoomLoop.ts, media/ (load / transform / security / video), tools/, prompts.ts.


中文

让纯文本模型也能「看见」:把图交给视觉模型「带外」分析,只把文字结论 + 机器可读元数据回传宿主。支持任何 OpenAI 兼容的视觉后端(GLM / Kimi / 小米 MiMo / OpenAI / 本地 vLLM),并用 server 主导的 agentic 缩放读出降采样会丢失的细节。

特性

  • 8 个任务专用工具:UI 转代码、OCR、报错诊断、技术图理解、图表读数、UI 对比、通用图像、视频理解。
  • 多后端:一个 provider + per-backend profile,改 3 个 env 即切换,GLM 默认。
  • Agentic 自动缩放:从全分辨率原图由粗到细裁切放大(九宫格 → grounding → 精确裁切),让小字可读;已实测能修正小细节上的幻觉(见已联机验证)。
  • 视频通用:无原生视频能力的后端自动走 ffmpeg 帧采样 → 多图分析。
  • 双输出content(结构化 markdown)+ structuredContent(confidence / regions / rounds / warnings / provider / model)。
  • 默认安全:本地路径白名单、URL 默认下载 + SSRF 校验(不透传)、magic-bytes 校验、大小上限。

工作原理(30 秒心智模型)

宿主 tool(image, detail_level)
  → 校验 + 载媒体(全分辨率原图 + 降采样概览图;路径白名单;URL SSRF)
  → overview ? 单次 : zoomLoop(确定性网格 → 模型投票/grounding → 裁原图 → 早退)
  → content(markdown) + structuredContent(元数据)

视觉模型是为「一个问题」临时请来的顾问,只有它的文字答案回到宿主。裁切始终从全分辨率原图取(不是降采样图),所以缩放才能真正找回细节。

快速上手

1. 安装构建(Node ≥ 20)

npm install
npm run build

2. 选后端。默认 GLM(z.ai)。任何支持 image_url 的 OpenAI 兼容端点都行,设 3 个 env:

export VISION_API_KEY=你的key
export VISION_BASE_URL=https://api.z.ai/api/paas/v4
export VISION_MODEL=glm-4.6v

3. 接入 MCP 客户端(zcode / Cline / Cursor / Claude Desktop……),在其 mcpServers 配置里加:

{
  "mcpServers": {
    "vision": {
      "command": "node",
      "args": ["<绝对路径>/dist/index.js"],
      "env": {
        "VISION_API_KEY": "你的key",
        "VISION_BASE_URL": "https://api.z.ai/api/paas/v4",
        "VISION_MODEL": "glm-4.6v",
        "VISION_ALLOWED_DIRS": "<你的项目绝对路径>"
      }
    }
  }
}

超时:推理型视觉模型较慢(生成代码/规格可能 30–60s+)。把宿主的 MCP 工具超时设宽松些(≥120s)。深度缩放期间 server 会发 notifications/progress,支持 resetTimeoutOnProgress 的客户端可借此保活。

使用教程

宿主是纯文本 agent,永远看不到图。你告诉它图在哪(VISION_ALLOWED_DIRS 下的路径,或 URL / data URI)、用哪个工具;agent 调用工具,server 回文字,agent 继续干活。

例 1 — 诊断报错截图

  1. 把截图存到 ./screenshots/error.png(在允许目录内)。
  2. 对 agent 说:「用 vision MCP 的 diagnose_error_screenshot 分析 ./screenshots/error.png
  3. 工具返回 ## 根因 / ## 错误原文 / ## 位置 / ## 修复步骤,agent 据此写修复。

例 2 — 读小细节(agentic 缩放)

模型一眼读不清的小字,传 detail_level: "fine"

{ "name": "extract_text_from_screenshot",
  "arguments": { "image": "./screenshots/big.png", "detail_level": "fine",
                 "question": "读出右下角的小密钥码" } }

server 跑缩放循环:把图划网格 → 模型投票哪块相关 → 从原图裁该块重读,找回单次概览会读错的文字。structuredContent.regions 显示它看了哪、rounds 显示用了几轮。

detail_level 取值:overview(单次快速)·normal·fine(深度缩放)·auto(默认,需要才缩放、清晰则早退)。

例 3 — 分析视频

{ "name": "video_analysis",
  "arguments": { "video": "./clips/repro.mp4", "question": "视频里是什么 bug?" } }

后端无原生视频时,server 用 ffmpeg 抽帧当多图分析 —— 任意视觉后端都能处理视频。

本地快测(无需客户端) —— MCP Inspector:

npx @modelcontextprotocol/inspector node dist/index.js
# 在 Inspector 界面里填好 env,点任意工具

或用自带脚本:node scripts/smoke.mjs(列工具)、KEY=... node scripts/livetest.mjs(全 8 工具端到端)。

工具

工具 用途
ui_to_artifact UI 截图 → 代码 / 规格
extract_text_from_screenshot 逐字 OCR
diagnose_error_screenshot 报错诊断(根因/原文/位置/修复)
understand_technical_diagram 架构/流程/UML/ER/时序图
analyze_data_visualization 图表/仪表盘读数
ui_diff_check 两张 UI 截图对比
image_analysis 通用图像理解(兜底)
video_analysis 视频理解(原生或帧采样)

公共参数:detail_levelquestionregionthinking

后端

后端 VISION_PROFILE VISION_BASE_URL VISION_MODEL
GLM (z.ai) glm https://api.z.ai/api/paas/v4 glm-4.6v
Kimi (Moonshot) kimi https://api.moonshot.ai/v1 <kimi 视觉模型>
小米 MiMo mimo <你的端点>/v1 mimo-v2.5(视觉版)
OpenAI openai https://api.openai.com/v1 gpt-4o
本地 vLLM/Ollama generic http://localhost:8000/v1 <本地 VLM>

任何 OpenAI 兼容、支持 image_url 的端点都能用 generic 直接接入。

配置(env)

变量 默认 说明
VISION_API_KEY(别名 Z_AI_API_KEY — 必填 后端 key
VISION_PROFILE glm glm / kimi / mimo / openai / generic
VISION_BASE_URL 随 profile OpenAI 兼容端点
VISION_MODEL 随 profile 视觉模型
VISION_ALLOWED_DIRS 当前目录 允许读取本地图片的目录(;/: 分隔)
VISION_ALLOW_URL_PASSTHROUGH false 是否把图片 URL 直接透传给后端(默认下载+SSRF 校验后转 data URI)
VISION_MAX_ZOOM_ROUNDS 3 agentic 缩放最大轮数
VISION_MAX_EDGE_PX 1568 概览图降采样最大边长
VISION_VIDEO_FRAMES 8 每个视频抽帧数
VISION_MAX_IMAGE_MB / _VIDEO_MB 10 / 50 大小上限

已联机验证

针对 小米 MiMomimo-v2.5 视觉版 / mimo-v2.5-pro 盲版)实测:

  • 8 个工具全部端到端跑通。
  • Agentic 缩放验证有效:角落小密钥码,overview 单次会编造REV: 2A-74A1-0);detail_level=fine 导航到右下、复跑稳定读出正确的 KEY: ZX-7741-Q
  • 视频帧采样:mimo 无原生视频 → 自动抽帧成功。
  • 盲模型优雅降级mimo-v2.5-pro 返回 404 No endpoints found that support image input → 工具以 isError + 清晰提示返回,不崩溃。

复现:KEY=... PROFILE=mimo MODEL=mimo-v2.5 BASE=<端点>/v1 node scripts/livetest.mjs

隐私

调用时图片/视频会发送到所配置的视觉后端 API。敏感内容请勿用不可信后端;本地部署(generic → 本地 vLLM)可避免数据外发。

开发

npm run dev    # tsx 直接跑源码
npm run build  # tsc → dist/
npm test       # vitest 离线测试(无需 key)—— 33 个用例

测试覆盖:缩放状态机(早退/预算/解析失败/越界/grounding/tool-calling)、媒体安全(magic-bytes/路径越界/SSRF/降采样)、工具 schema、视频帧采样。

架构

src/provider/(OpenAI 兼容 + profile)、core/zoomLoop.tsmedia/(load/transform/security/video)、tools/prompts.ts

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured