vlm-mcp-server
An MCP server providing vision and video analysis tools, configurable with any model provider.
README
VLM MCP Server
中文文档 | English

A Model Context Protocol (MCP) server providing vision & video analysis tools, configurable with any model provider.
This is a reverse-engineered and extended reimplementation of @z_ai/mcp-server (Apache-2.0, credit to Chao Gong, Lei Yuan / Z.AI). It introduces a provider abstraction layer so the same set of tools can run against any of three API families:
- Chat Completions — OpenAI-compatible
POST {base}/chat/completions(OpenAI, Z.AI, Zhipu, OpenRouter, Together, Groq, DeepSeek, Moonshot, local Ollama / LM Studio, …) - Responses — OpenAI
POST {base}/responses(gpt-4o, o-series reasoning models) - Anthropic Messages —
POST {base}/v1/messages(Claude, and Anthropic-compatible gateways)
Quick Start
npx -y @syntx-ai/vlm-mcp-server
That's it for the server side — it speaks MCP over stdio. You need to configure it in your MCP client. Pick your provider and set three environment variables:
| Provider | Environment variables |
|---|---|
| Chat Completions | OPENAI_CHAT_COMPLETIONS_API_KEY · OPENAI_CHAT_COMPLETIONS_BASE_URL · OPENAI_CHAT_COMPLETIONS_MODEL |
| Responses | OPENAI_RESPONSES_API_KEY · OPENAI_RESPONSES_BASE_URL · OPENAI_RESPONSES_MODEL |
| Anthropic | OPENAI_ANTHROPIC_API_KEY · OPENAI_ANTHROPIC_BASE_URL · OPENAI_ANTHROPIC_MODEL |
Claude Code one-liner (Chat Completions example — replace with your values):
claude mcp add -s user vlm-mcp-server \
--env OPENAI_CHAT_COMPLETIONS_API_KEY=sk-... \
OPENAI_CHAT_COMPLETIONS_BASE_URL=https://api.openai.com/v1/ \
OPENAI_CHAT_COMPLETIONS_MODEL=gpt-4o \
-- npx -y @syntx-ai/vlm-mcp-server
For other clients (Cline, OpenCode, Crush, Roo Code, …), see Client Configuration.
Available Tools
Image Analysis
| Tool | Description |
|---|---|
ui_to_artifact |
Convert UI screenshots to code, prompts, specs, or descriptions |
extract_text_from_screenshot |
OCR — extract code, terminal output, or text from screenshots |
diagnose_error_screenshot |
Analyze error messages and stack traces, suggest fixes |
understand_technical_diagram |
Analyze architecture, flowchart, UML, ER, and sequence diagrams |
analyze_data_visualization |
Extract insights, trends, and anomalies from charts |
ui_diff_check |
Visual regression — compare expected vs actual UI, prioritize issues |
analyze_image |
General-purpose image analysis (fallback) |
Video Analysis
| Tool | Description |
|---|---|
analyze_video |
Video content analysis (local files or URLs, ≤8MB, MP4/MOV/M4V) |
Configuration
The server loads variables from a .env file at startup (real environment variables take precedence). Three layers are supported; precedence is per-provider groups > generic > legacy.
Per-provider groups
Configure each API family independently. auto picks the first group with both a key and a base URL set.
| Variable group | API family |
|---|---|
OPENAI_CHAT_COMPLETIONS_API_KEY / _BASE_URL / _MODEL |
Chat Completions |
OPENAI_RESPONSES_API_KEY / _BASE_URL / _MODEL |
Responses |
OPENAI_ANTHROPIC_API_KEY / _BASE_URL / _MODEL |
Anthropic Messages |
Generic variables
| Variable | Description | Default |
|---|---|---|
VLM_API_KEY |
API key | (required) |
VLM_BASE_URL |
Provider API root | Zhipu default |
VLM_VISION_MODEL |
Model name | glm-4.6v |
VLM_PROVIDER |
Provider family: auto / chat-completions / responses / anthropic |
auto |
VLM_VISION_MODEL_TEMPERATURE |
Sampling temperature | 0.8 |
VLM_VISION_MODEL_TOP_P |
Top-p | 0.6 |
VLM_VISION_MODEL_MAX_TOKENS |
Max output tokens | 32768 |
VLM_TIMEOUT |
Request timeout (ms) | 300000 |
VLM_RETRY_COUNT |
Retry attempts | 1 |
VLM_ENABLE_THINKING |
Enable provider-specific reasoning / thinking request fields. Off by default for broad OpenAI-compatible Chat Completions support. | false |
VLM_ANTHROPIC_VERSION |
anthropic-version header (Anthropic only) |
2023-06-01 |
VLM_LOG_PATH |
Custom log file path | ~/.vlm/vlm-mcp-YYYY-MM-DD.log |
Provider auto-detection
In auto mode (when no OPENAI_* group is set), the provider is inferred as follows:
- Base URL contains
anthropic, or key starts withsk-ant→anthropic - Otherwise →
chat-completions(the most broadly compatible default)
Usage Examples
Once the server is installed in your client, you can use it through conversation. For example, in Claude Code, type describe this demo.png — the MCP Server will process the image and return a description (the image must exist in the current directory).
Outside Claude Code, pasting an image directly into the client will NOT invoke this MCP Server — the client encodes the image and calls the model API itself. Best practice: place images in a local directory and refer to them by name or path in conversation, e.g.
What does demo.png describe?
Troubleshooting
Run the server directly from the command line to verify it starts, isolating environment / permission issues:
# Linux / macOS
OPENAI_CHAT_COMPLETIONS_API_KEY=sk-... \
OPENAI_CHAT_COMPLETIONS_BASE_URL=https://api.openai.com/v1/ \
OPENAI_CHAT_COMPLETIONS_MODEL=gpt-4o \
npx -y @syntx-ai/vlm-mcp-server
# Windows CMD
set OPENAI_CHAT_COMPLETIONS_API_KEY=sk-... && set OPENAI_CHAT_COMPLETIONS_BASE_URL=https://api.openai.com/v1/ && set OPENAI_CHAT_COMPLETIONS_MODEL=gpt-4o && npx -y @syntx-ai/vlm-mcp-server
# Windows PowerShell
$env:OPENAI_CHAT_COMPLETIONS_API_KEY="sk-..."; $env:OPENAI_CHAT_COMPLETIONS_BASE_URL="https://api.openai.com/v1/"; $env:OPENAI_CHAT_COMPLETIONS_MODEL="gpt-4o"; npx -y @syntx-ai/vlm-mcp-server
- If it starts successfully, the environment is correct — the issue is likely in the client's MCP config; double-check it.
- If it fails, investigate the error message (pasting it to an LLM for analysis is recommended).
Common issues
Connection failure
- Ensure Node.js 18 or newer is installed.
- Run
node -vandnpx -vto confirm the runtime is available. - Verify the environment variables (
OPENAI_*triple orVLM_*) are set correctly.
Invalid API Key
- Confirm the API Key was copied correctly.
- Check that the API Key is activated.
- Ensure the selected provider family matches the API Key (Chat Completions / Responses / Anthropic).
- Check that the API Key has sufficient balance.
Connection timeout
- Check your network connection.
- Check firewall settings.
- Try switching to a different provider family or base URL.
- Increase the timeout (
VLM_TIMEOUT, default 300000ms).
Architecture
src/
├── index.ts # Entry point: starts the MCP server, registers all tools
├── types/ # Error types (McpError, ApiError, ValidationError, …)
├── core/
│ ├── environment.ts # Env config (VLM_* + OPENAI_* groups), URL resolution
│ ├── chat-service.ts # Delegates to the active VisionProvider
│ ├── file-service.ts # File validation + base64 encoding (image/video)
│ ├── base-image-service.ts # Shared image-processing logic for all image tools
│ ├── api-common.ts # Message builders, response helpers, retry wrapper
│ ├── error-handler.ts # Error hierarchy + handling/recovery strategies
│ └── logger.ts # stderr + file logger (keeps stdout JSON-clean)
├── providers/ # Pluggable model-provider abstraction
│ ├── types.ts # VisionProvider interface, ChatMessage, postJson helper
│ ├── chat-completions.ts # OpenAI-compatible Chat Completions
│ ├── responses.ts # OpenAI Responses API
│ ├── anthropic.ts # Anthropic Messages API
│ └── index.ts # Provider selection (VLM_PROVIDER / auto-infer)
├── prompts/ # System prompts for each specialized tool
└── tools/ # 8 tool registrations (7 image + 1 video)
The provider layer (src/providers/) is the key extension. Each provider implements a VisionProvider interface that takes normalized ChatMessage[] (the OpenAI Chat Completions content-part format as internal lingua franca) and translates it to the provider's wire format. chat-service.ts simply delegates to the resolved provider, so none of the tool code needed to change.
License
Apache-2.0
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.