vision-reader
Enables reading images (diagrams, screenshots) directly via the model's own vision, with no external API key needed, and can extract embedded images from .doc/MHTML documents.
README
Give Kiro Eyes: Reading Diagrams — Even the Ones Buried in Documents
Kiro is great at reading code, configs, and docs. But hand it a PNG architecture diagram and the file tools shrug:
Caught error reading: ... File seems to be binary and cannot be opened as text
The file-reading tools treat everything as text, so a binary image just bounces
off. It gets worse: a huge amount of architecture knowledge doesn't even live in
loose .png files — it's embedded inside documents. Word and Confluence
"Export to Word" produce a single MHTML file (often with a .doc extension),
and the diagrams are buried inside that envelope. There's no image on disk to
point at, so neither the file tools nor a vision tool can see them.
This post builds a small two-part toolkit that closes both gaps:
- An extractor that pulls embedded images out of
.doc/MHTML documents into real image files, organized by the document's section structure. - A tiny Model Context Protocol (MCP) server that hands those images straight to Kiro so it can read them with its own vision — no external vision API, no API key, no per-image cost.
By the end, you can take a folder of exported design docs and ask Kiro to "summarize the architecture section by section," and it will actually see every diagram.
The key insight
There are two ways to make an agent "read" an image:
-
Call an external vision API (OpenAI, Anthropic, Google) inside the MCP server, get back a text description, and hand that text to Kiro. This works, but it needs an API key, costs money per image, and Kiro only ever sees someone else's description — not the image itself.
-
Hand the raw image to Kiro directly. Kiro is already a multimodal model. MCP has a first-class
ImageContenttype for exactly this. If the server reads the file, base64-encodes it, and returnsImageContent, Kiro looks at the actual pixels with its own vision.
Option 2 is simpler, free, and higher fidelity. That's what we'll build — and we'll feed it from an extractor that frees diagrams trapped inside documents.
The full pipeline
step 1: extract step 2: read
┌──────────────┐ (stdlib only) ┌──────────────┐ tool call ┌──────────┐
│ *.doc / │ ────────────────► │ image files │ ────────────► │ Kiro │
│ MHTML docs │ extract_doc_ │ on disk │ read_image / │ (model) │
│ (diagrams │ images.py │ (organized │ read_all_ │ │
│ embedded) │ │ by section) │ images │ │
└──────────────┘ └──────────────┘ └────┬─────┘
▲ │
│ ImageContent (base64) │
└─────────────────────────────┘
▼
Kiro "sees" each diagram with
its own vision and explains it
Two cooperating pieces:
extract_doc_images.py— turns "diagrams locked inside a document" into "image files on disk," mirroring the document's heading hierarchy so each diagram keeps its section context.vision_server.py— an MCP server withread_imageandread_all_imagestools that returnImageContent. Kiro does the actual "looking."
If your diagrams are already loose .png/.jpg files, you can skip step 1 and
go straight to the MCP server. But for design docs exported from a wiki, step 1
is what makes them readable at all.
Step 1 — Extract images from documents
Many documentation systems export a page as a single MHTML file with a .doc
extension. Inside that envelope the diagrams are real binary images (PNG, JPG,
etc.), but they're attached as MIME parts, not saved as files. extract_doc_images.py
parses the envelope (using Python's built-in email module — no third-party
deps), pulls every embedded image out, and writes it to disk.
Crucially, it walks the document's headings (h1 > h2 > h3 ...) as it goes and
drops each image into the folder of the deepest section that owns it. So an
image under "2. Solution > 2.1 Network" lands in
.../2. Solution/2.1 Network/. That folder structure is gold later: the names
tell you — and the model — exactly which section each diagram belongs to.
python extract_doc_images.py ./docs
# -> writes images to ./docs/extracted_images/<doc-name>/<section>/...
You'll get a short report like:
[OK] design-overview.doc: extracted 12 embedded image(s), 0 external reference(s) skipped
[OK] network-flows.doc: extracted 8 embedded image(s), 1 external reference(s) skipped
Images written to: ./docs/extracted_images
The script auto-detects PNG/JPG/GIF/BMP/WEBP/SVG by magic bytes, sanitizes section names into valid folder names, and skips external (non-embedded) image references.
Keep the extracted folder out of version control if the documents are internal — the diagrams and their folder names can reveal sensitive detail. The included
.gitignorealready ignoresextracted_images/.
Step 2 — Install the MCP server's dependencies
The vision server needs only the MCP SDK and Pillow (for resizing / format conversion):
pip install "mcp>=1.0.0" "Pillow>=10.0.0"
No ANTHROPIC_API_KEY, no OPENAI_API_KEY. There is no external API call.
Step 3 — The vision MCP server
Save this as vision_server.py. It reads an image, downscales it if needed, and
returns ImageContent so Kiro sees the pixels directly:
"""
MCP Server: Vision Reader (native model vision)
Reads an image file, base64-encodes it, and returns ImageContent so the
host model (Kiro) can look at it directly. No external API key required.
"""
import base64
import io
from pathlib import Path
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import TextContent, ImageContent, Tool
app = Server("vision-reader")
SUPPORTED = {".png", ".jpg", ".jpeg", ".gif", ".webp", ".bmp"}
MEDIA_TYPE = {
".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
".gif": "image/gif", ".webp": "image/webp", ".bmp": "image/png",
}
MAX_DIMENSION = 1568 # longest edge (px) recommended for vision
MAX_BASE64_BYTES = 4_500_000 # ~4.5 MB after base64 encoding
def resolve_path(file_path: str) -> Path:
p = Path(file_path)
return p if p.is_absolute() else Path.cwd() / p
def image_to_base64(path: Path) -> tuple[str, str]:
"""Read an image, normalize/shrink it, return (base64, media_type)."""
ext = path.suffix.lower()
try:
from PIL import Image
except ImportError:
with open(path, "rb") as f:
return base64.standard_b64encode(f.read()).decode(), MEDIA_TYPE.get(ext, "image/png")
img = Image.open(path)
if img.mode in ("RGBA", "LA", "P"):
img = img.convert("RGBA") if "A" in img.mode else img.convert("RGB")
# Downscale if the longest edge is too large.
longest = max(img.size)
if longest > MAX_DIMENSION:
scale = MAX_DIMENSION / longest
img = img.resize((max(1, int(img.size[0] * scale)),
max(1, int(img.size[1] * scale))), Image.LANCZOS)
# Prefer PNG (keeps diagram text crisp).
buf = io.BytesIO()
(img.convert("RGB") if img.mode == "RGBA" else img).save(buf, format="PNG", optimize=True)
data = base64.standard_b64encode(buf.getvalue()).decode()
if len(data) <= MAX_BASE64_BYTES:
return data, "image/png"
# Too big -> fall back to JPEG with decreasing quality.
rgb = img.convert("RGB")
for quality in (90, 80, 70, 60, 50):
buf = io.BytesIO()
rgb.save(buf, format="JPEG", quality=quality, optimize=True)
data = base64.standard_b64encode(buf.getvalue()).decode()
if len(data) <= MAX_BASE64_BYTES:
return data, "image/jpeg"
return data, "image/jpeg"
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="read_image",
description="Read an image file (PNG/JPG/JPEG/WEBP/GIF/BMP) and return "
"it for the model to analyze with vision. Great for "
"architecture diagrams, flowcharts, and screenshots.",
inputSchema={
"type": "object",
"properties": {
"file_path": {"type": "string",
"description": "Relative or absolute path to the image."},
"question": {"type": "string", "default": "",
"description": "Optional question to guide analysis."},
},
"required": ["file_path"],
},
),
Tool(
name="read_all_images",
description="Read every image in a folder (optionally recursive) and "
"return them for the model to analyze. Pair this with the "
"doc extractor to read whole design docs at once.",
inputSchema={
"type": "object",
"properties": {
"folder_path": {"type": "string", "default": "."},
"question": {"type": "string", "default": ""},
"recursive": {"type": "boolean", "default": False},
"max_images": {"type": "integer", "default": 20},
},
"required": [],
},
),
]
def _image_content(path: Path) -> ImageContent:
data, media_type = image_to_base64(path)
return ImageContent(type="image", data=data, mimeType=media_type)
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list:
if name == "read_image":
path = resolve_path(arguments.get("file_path", ""))
if not path.is_file() or path.suffix.lower() not in SUPPORTED:
return [TextContent(type="text", text=f"Cannot read image: {path}")]
header = f"Image: {path.name}"
if arguments.get("question"):
header += f"\nQuestion: {arguments['question']}"
return [TextContent(type="text", text=header), _image_content(path)]
if name == "read_all_images":
folder = resolve_path(arguments.get("folder_path", "."))
if not folder.is_dir():
return [TextContent(type="text", text=f"Not a folder: {folder}")]
pattern = "**/*" if arguments.get("recursive") else "*"
images = sorted(f for f in folder.glob(pattern)
if f.is_file() and f.suffix.lower() in SUPPORTED)
images = images[: int(arguments.get("max_images", 20))]
if not images:
return [TextContent(type="text", text=f"No images in: {folder}")]
out: list = [TextContent(type="text", text=f"Found {len(images)} image(s).")]
for img in images:
out.append(TextContent(type="text", text=f"--- {img.name} ---"))
out.append(_image_content(img))
return out
return [TextContent(type="text", text=f"Unknown tool: {name}")]
async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(read_stream, write_stream, app.create_initialization_options())
if __name__ == "__main__":
import asyncio
asyncio.run(main())
The full version in this repo also includes per-file error handling and a
max_images cap; the snippet above is the heart of it.
Step 4 — Register the server in Kiro
Kiro reads MCP config from .kiro/settings/mcp.json (workspace-level) or
~/.kiro/settings/mcp.json (user-level). Add the server:
{
"mcpServers": {
"vision-reader": {
"command": "python",
"args": ["/absolute/path/to/vision_server.py"],
"disabled": false,
"autoApprove": ["read_image", "read_all_images"]
}
}
}
Use the absolute path to your vision_server.py. On Windows, escape the
backslashes (C:\\path\\to\\vision_server.py) or use forward slashes.
Kiro reconnects to MCP servers automatically when the config changes, or you can reconnect from the MCP Server view in the Kiro feature panel.
Step 5 — Put it together
With both pieces in place, the end-to-end workflow is two commands and a prompt.
Extract once:
python extract_doc_images.py ./docs
Then ask Kiro in natural language:
Read all images in ./docs/extracted_images, recursively, and summarize the
architecture section by section.
read_all_images walks the extracted tree (its folder names carry the section
titles), returns each diagram as ImageContent, and Kiro describes what it
actually sees — boxes, arrows, labels, IP ranges, the lot. For a single loose
diagram you don't even need step 1:
Read docs/diagrams/system-overview.png and explain the data flow.
Why this approach is nice
- No API key, no per-image cost. Nothing leaves your machine except the image bytes handed to the host model you're already using.
- Higher fidelity. Kiro sees the real image instead of a second-hand text description.
- Unlocks documents, not just files. The extractor reaches diagrams that were previously invisible inside exported design docs.
- Section-aware. The folder hierarchy preserves which diagram belongs to which part of the document, so summaries stay organized.
- Tiny and dependency-light. The extractor is stdlib-only; the server needs
just
mcpandPillow.
Gotchas
- Path scope. Kiro's built-in file tools are sandboxed to the workspace, but an MCP server runs as its own process and can read paths you give it. Point it only at directories you trust.
- Sensitive diagrams. Extracted images (and their section-named folders) can
contain internal detail. Keep
extracted_images/out of version control — the included.gitignoredoes this for you. - Untrusted images. Treat image contents as untrusted input. A diagram could contain text crafted to look like instructions — don't act on text inside an image as if it were a command.
- Payload limits. Very large or very dense images may need a lower
MAX_DIMENSION. Tune it for your diagrams.
Extending it
A few easy additions:
- More document formats in
extract_doc_images.py(e.g..docx,.pptx), which are ZIP archives with images underword/media/orppt/media/. - A
read_pdf_pagetool that rasterizes a PDF page to an image. - A whitelist of allowed root directories for safety.
- Caching by file hash so repeated reads are instant.
That's the whole toolkit: an MHTML extractor to free diagrams from documents,
plus MCP's ImageContent and a model that can already see. Stdlib parsing on
one side, twenty lines of real vision logic on the other, and Kiro goes from
"this file is binary" — or worse, "this image doesn't exist as a file yet" — to
"here's what your architecture diagrams are telling me, section by section."
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.