vlm-mcp-server

vlm-mcp-server

An MCP server providing vision and video analysis tools, configurable with any model provider.

Category
Visit Server

README

VLM MCP Server

中文文档 | English

VLM MCP Server hero

A Model Context Protocol (MCP) server providing vision & video analysis tools, configurable with any model provider.

This is a reverse-engineered and extended reimplementation of @z_ai/mcp-server (Apache-2.0, credit to Chao Gong, Lei Yuan / Z.AI). It introduces a provider abstraction layer so the same set of tools can run against any of three API families:

  • Chat Completions — OpenAI-compatible POST {base}/chat/completions (OpenAI, Z.AI, Zhipu, OpenRouter, Together, Groq, DeepSeek, Moonshot, local Ollama / LM Studio, …)
  • Responses — OpenAI POST {base}/responses (gpt-4o, o-series reasoning models)
  • Anthropic MessagesPOST {base}/v1/messages (Claude, and Anthropic-compatible gateways)

Quick Start

npx -y @syntx-ai/vlm-mcp-server

That's it for the server side — it speaks MCP over stdio. You need to configure it in your MCP client. Pick your provider and set three environment variables:

Provider Environment variables
Chat Completions OPENAI_CHAT_COMPLETIONS_API_KEY · OPENAI_CHAT_COMPLETIONS_BASE_URL · OPENAI_CHAT_COMPLETIONS_MODEL
Responses OPENAI_RESPONSES_API_KEY · OPENAI_RESPONSES_BASE_URL · OPENAI_RESPONSES_MODEL
Anthropic OPENAI_ANTHROPIC_API_KEY · OPENAI_ANTHROPIC_BASE_URL · OPENAI_ANTHROPIC_MODEL

Claude Code one-liner (Chat Completions example — replace with your values):

claude mcp add -s user vlm-mcp-server \
  --env OPENAI_CHAT_COMPLETIONS_API_KEY=sk-... \
       OPENAI_CHAT_COMPLETIONS_BASE_URL=https://api.openai.com/v1/ \
       OPENAI_CHAT_COMPLETIONS_MODEL=gpt-4o \
  -- npx -y @syntx-ai/vlm-mcp-server

For other clients (Cline, OpenCode, Crush, Roo Code, …), see Client Configuration.

Available Tools

Image Analysis

Tool Description
ui_to_artifact Convert UI screenshots to code, prompts, specs, or descriptions
extract_text_from_screenshot OCR — extract code, terminal output, or text from screenshots
diagnose_error_screenshot Analyze error messages and stack traces, suggest fixes
understand_technical_diagram Analyze architecture, flowchart, UML, ER, and sequence diagrams
analyze_data_visualization Extract insights, trends, and anomalies from charts
ui_diff_check Visual regression — compare expected vs actual UI, prioritize issues
analyze_image General-purpose image analysis (fallback)

Video Analysis

Tool Description
analyze_video Video content analysis (local files or URLs, ≤8MB, MP4/MOV/M4V)

Configuration

The server loads variables from a .env file at startup (real environment variables take precedence). Three layers are supported; precedence is per-provider groups > generic > legacy.

Per-provider groups

Configure each API family independently. auto picks the first group with both a key and a base URL set.

Variable group API family
OPENAI_CHAT_COMPLETIONS_API_KEY / _BASE_URL / _MODEL Chat Completions
OPENAI_RESPONSES_API_KEY / _BASE_URL / _MODEL Responses
OPENAI_ANTHROPIC_API_KEY / _BASE_URL / _MODEL Anthropic Messages

Generic variables

Variable Description Default
VLM_API_KEY API key (required)
VLM_BASE_URL Provider API root Zhipu default
VLM_VISION_MODEL Model name glm-4.6v
VLM_PROVIDER Provider family: auto / chat-completions / responses / anthropic auto
VLM_VISION_MODEL_TEMPERATURE Sampling temperature 0.8
VLM_VISION_MODEL_TOP_P Top-p 0.6
VLM_VISION_MODEL_MAX_TOKENS Max output tokens 32768
VLM_TIMEOUT Request timeout (ms) 300000
VLM_RETRY_COUNT Retry attempts 1
VLM_ENABLE_THINKING Enable provider-specific reasoning / thinking request fields. Off by default for broad OpenAI-compatible Chat Completions support. false
VLM_ANTHROPIC_VERSION anthropic-version header (Anthropic only) 2023-06-01
VLM_LOG_PATH Custom log file path ~/.vlm/vlm-mcp-YYYY-MM-DD.log

Provider auto-detection

In auto mode (when no OPENAI_* group is set), the provider is inferred as follows:

  • Base URL contains anthropic, or key starts with sk-antanthropic
  • Otherwise → chat-completions (the most broadly compatible default)

Usage Examples

Once the server is installed in your client, you can use it through conversation. For example, in Claude Code, type describe this demo.png — the MCP Server will process the image and return a description (the image must exist in the current directory).

Outside Claude Code, pasting an image directly into the client will NOT invoke this MCP Server — the client encodes the image and calls the model API itself. Best practice: place images in a local directory and refer to them by name or path in conversation, e.g. What does demo.png describe?

Troubleshooting

Run the server directly from the command line to verify it starts, isolating environment / permission issues:

# Linux / macOS
OPENAI_CHAT_COMPLETIONS_API_KEY=sk-... \
OPENAI_CHAT_COMPLETIONS_BASE_URL=https://api.openai.com/v1/ \
OPENAI_CHAT_COMPLETIONS_MODEL=gpt-4o \
npx -y @syntx-ai/vlm-mcp-server

# Windows CMD
set OPENAI_CHAT_COMPLETIONS_API_KEY=sk-... && set OPENAI_CHAT_COMPLETIONS_BASE_URL=https://api.openai.com/v1/ && set OPENAI_CHAT_COMPLETIONS_MODEL=gpt-4o && npx -y @syntx-ai/vlm-mcp-server

# Windows PowerShell
$env:OPENAI_CHAT_COMPLETIONS_API_KEY="sk-..."; $env:OPENAI_CHAT_COMPLETIONS_BASE_URL="https://api.openai.com/v1/"; $env:OPENAI_CHAT_COMPLETIONS_MODEL="gpt-4o"; npx -y @syntx-ai/vlm-mcp-server
  • If it starts successfully, the environment is correct — the issue is likely in the client's MCP config; double-check it.
  • If it fails, investigate the error message (pasting it to an LLM for analysis is recommended).

Common issues

Connection failure

  1. Ensure Node.js 18 or newer is installed.
  2. Run node -v and npx -v to confirm the runtime is available.
  3. Verify the environment variables (OPENAI_* triple or VLM_*) are set correctly.

Invalid API Key

  1. Confirm the API Key was copied correctly.
  2. Check that the API Key is activated.
  3. Ensure the selected provider family matches the API Key (Chat Completions / Responses / Anthropic).
  4. Check that the API Key has sufficient balance.

Connection timeout

  1. Check your network connection.
  2. Check firewall settings.
  3. Try switching to a different provider family or base URL.
  4. Increase the timeout (VLM_TIMEOUT, default 300000ms).

Architecture

src/
├── index.ts                  # Entry point: starts the MCP server, registers all tools
├── types/                    # Error types (McpError, ApiError, ValidationError, …)
├── core/
│   ├── environment.ts        # Env config (VLM_* + OPENAI_* groups), URL resolution
│   ├── chat-service.ts       # Delegates to the active VisionProvider
│   ├── file-service.ts       # File validation + base64 encoding (image/video)
│   ├── base-image-service.ts # Shared image-processing logic for all image tools
│   ├── api-common.ts         # Message builders, response helpers, retry wrapper
│   ├── error-handler.ts      # Error hierarchy + handling/recovery strategies
│   └── logger.ts            # stderr + file logger (keeps stdout JSON-clean)
├── providers/                # Pluggable model-provider abstraction
│   ├── types.ts              # VisionProvider interface, ChatMessage, postJson helper
│   ├── chat-completions.ts   # OpenAI-compatible Chat Completions
│   ├── responses.ts          # OpenAI Responses API
│   ├── anthropic.ts          # Anthropic Messages API
│   └── index.ts              # Provider selection (VLM_PROVIDER / auto-infer)
├── prompts/                  # System prompts for each specialized tool
└── tools/                    # 8 tool registrations (7 image + 1 video)

The provider layer (src/providers/) is the key extension. Each provider implements a VisionProvider interface that takes normalized ChatMessage[] (the OpenAI Chat Completions content-part format as internal lingua franca) and translates it to the provider's wire format. chat-service.ts simply delegates to the resolved provider, so none of the tool code needed to change.

License

Apache-2.0

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured