web-perception-mcp
A DOM-first, vision-second MCP server for webpage analysis, combining structural DOM extraction with MiniMax vision to provide accurate, context-rich page understanding for AI coding assistants.
README
web-perception MCP Server
A Model Context Protocol (MCP) server for DOM-first, vision-second webpage analysis. Built for AI coding assistants like Cline.
What it does
This server gives AI agents the ability to understand webpages by combining structural DOM data with visual analysis via MiniMax vision. Instead of sending a raw screenshot to the vision model, it extracts the page's heading hierarchy, interactive elements, landmarks, and layout first — then sends both the screenshot and the structural context to MiniMax for significantly better analysis.
Tools
| Tool | Description | Cost |
|---|---|---|
inspect_page |
Extract page structure, metadata, and content. Basic mode (HTTP fetch) or full mode (headless browser). | Free — no vision API call |
analyze_image |
Analyse local image files (screenshots, mockups, designs) using MiniMax vision. | Vision API call |
analyze_page_visual |
End-to-end: browser → DOM extraction → screenshot → enriched prompt → MiniMax vision → structured findings with element refs. | Vision API call |
extract_page_data |
Extract structured data from a page matching a provided schema. Progressive escalation: DOM → browser → vision. | Vision API call only if DOM insufficient |
Prerequisites
- Node.js 18+
- A nanoGPT account with an API key
- An MCP-compatible client like Cline, Cursor, or Claude Desktop
Installation
git clone https://github.com/JaviGala/web-perception-mcp.git
cd web-perception-mcp
npm install
cp .env.example .env # add your NANOGPT_API_KEY
Configuration
Set via .env file or environment variables:
| Variable | Default | Description |
|---|---|---|
NANOGPT_API_KEY |
(required) | Your nanoGPT API key |
NANOGPT_BASE_URL |
https://nano-gpt.com/api/subscription/v1 |
API endpoint. Use /api/v1 for pay-as-you-go. |
NANOGPT_MODEL |
minimax/minimax-m3 |
Vision model to use |
NANOGPT_TEMPERATURE |
0.3 |
Default temperature |
NANOGPT_MAX_TOKENS |
2000 |
Default max tokens |
REQUEST_TIMEOUT_MS |
30000 |
HTTP request timeout |
MAX_TEXT_LENGTH |
50000 |
Max characters in extracted text |
MCP Client Configuration
Cline (VS Code)
{
"mcpServers": {
"web-perception": {
"type": "stdio",
"command": "/usr/local/bin/node",
"timeout": 120,
"args": ["/path/to/web-perception-mcp/src/server.js"],
"env": {
"NANOGPT_API_KEY": "sk-your-key-here"
}
}
}
}
How analyze_page_visual works
This is the core differentiator — the "clever" analysis workflow:
- Launch headless browser (Playwright + Chromium, no visible window)
- Navigate to the target URL and wait for the page to load
- Extract DOM structure — heading hierarchy, interactive elements with bounding boxes, landmarks, visible text
- Take a screenshot (viewport or full page)
- Build enriched prompt — combines DOM structure with the user's question into a structured prompt
- Send to MiniMax — the vision model receives both the screenshot image AND the structural context
- Parse response — extracts structured JSON findings with element refs, severity, confidence scores
This produces significantly better analysis than a raw screenshot alone because MiniMax can correlate visual elements with their DOM positions, roles, and content.
Architecture
src/
server.js — MCP server, tool definitions, request routing
vision.js — MiniMax API client, prompt builder, result parser
browser.js — Playwright browser pool management
extraction.js — DOM extraction (static + browser-rendered)
security.js — URL validation, logging, timeout utilities
Evolution
This project was built as a replacement for nanogpt-vision-mcp, which only supported raw image analysis without DOM context. The MiniMax client code was carried forward with no functional changes. See the nanogpt-vision-mcp README for the full migration story.
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.