wechat-to-md
Converts WeChat Official Account articles to clean Markdown with locally downloaded images, supporting single and batch operations via MCP tools.
README
wechat-article-for-ai
English
A modular Python tool that converts WeChat Official Account (微信公众号) articles into clean Markdown files with locally downloaded images. Designed for both human use (CLI) and AI agent integration (MCP server + SKILL.md).
Features
- Anti-detection scraping — Uses Camoufox (stealth Firefox) to bypass WeChat's bot detection
- Smart page loading —
networkidlewait instead of hardcoded sleep - Retry logic — 3× exponential backoff for page fetching, 3× linear backoff for image downloads
- CAPTCHA detection — Explicit detection with actionable error messages
- Batch processing — Multiple URLs via args or file input
- Image localization — Concurrent async downloads with Content-Type based extension inference
- Code block preservation — Language detection, CSS counter garbage filtering
- Media extraction — Handles WeChat's
<mpvoice>audio and<mpvideo>video elements - YAML frontmatter — Structured metadata (title, author, date, source)
- MCP server — Expose as tools for any MCP-compatible AI client
- SKILL.md — Ready for Claude Code skill integration
Installation
git clone https://github.com/bzd6661/wechat-article-for-ai.git
cd wechat-article-for-ai
pip install -r requirements.txt
Camoufox browser will be auto-downloaded on first run.
Usage
CLI — Single Article
python main.py "https://mp.weixin.qq.com/s/ARTICLE_ID"
CLI — Batch from File
python main.py -f urls.txt -o ./output -v
CLI Options
| Flag | Description |
|---|---|
urls |
One or more WeChat article URLs |
-f, --file FILE |
Text file with URLs (one per line, # for comments) |
-o, --output DIR |
Output directory (default: ./output) |
-c, --concurrency N |
Max concurrent image downloads (default: 5) |
--no-images |
Skip image download, keep remote URLs |
--no-headless |
Show browser window (for solving CAPTCHAs) |
--force |
Overwrite existing output |
--no-frontmatter |
Use blockquote metadata instead of YAML frontmatter |
-v, --verbose |
Enable debug logging |
MCP Server
Run as an MCP server for AI tool integration:
python mcp_server.py
Tools exposed:
convert_article— Convert a single WeChat article to Markdownbatch_convert— Convert multiple articles in one call
MCP client configuration (e.g. claude_desktop_config.json):
{
"mcpServers": {
"wechat-to-md": {
"command": "python",
"args": ["mcp_server.py"],
"cwd": "/path/to/wechat-article-for-ai"
}
}
}
Output Structure
output/
<article-title>/
<article-title>.md
images/
img_001.png
img_002.jpg
...
Project Structure
wechat_to_md/
__init__.py # Package init, public API
errors.py # CaptchaError, NetworkError, ParseError
utils.py # Logging, filename sanitizer, timestamp, image ext inference
scraper.py # Camoufox + networkidle + retry with exponential backoff
parser.py # BeautifulSoup: metadata, code blocks, media, noise removal
converter.py # markdownify + YAML frontmatter + image URL replacement
downloader.py # httpx async + retry per image + Content-Type inference
cli.py # argparse CLI with batch support
mcp_server.py # FastMCP server with convert_article / batch_convert
main.py # CLI entry point
mcp_server.py # MCP server entry point
SKILL.md # AI skill definition
Troubleshooting
| Problem | Solution |
|---|---|
| CAPTCHA / verification page | Run with --no-headless to solve manually |
| Empty content | WeChat may be rate-limiting; wait and retry |
| Image download failures | Failed images keep remote URLs; re-run with --force |
License
MIT
中文
一个模块化的 Python 工具,将微信公众号文章转换为干净的 Markdown 文件并下载图片到本地。同时支持人工使用(CLI)和 AI 智能体集成(MCP 服务器 + SKILL.md)。
功能特点
- 反检测抓取 — 使用 Camoufox(隐身 Firefox)绕过微信的反爬机制
- 智能页面等待 — 使用
networkidle替代硬编码的 sleep - 重试机制 — 页面加载 3 次指数退避重试,图片下载 3 次线性退避重试
- 验证码检测 — 明确识别验证码页面并给出可操作的错误提示
- 批量处理 — 支持多个 URL 参数或从文件读取
- 图片本地化 — 异步并发下载,基于 Content-Type 推断图片格式
- 代码块保留 — 自动检测编程语言,过滤 CSS 计数器垃圾文本
- 媒体提取 — 处理微信的
<mpvoice>音频和<mpvideo>视频元素 - YAML 元数据 — 结构化的 frontmatter(标题、作者、日期、来源)
- MCP 服务器 — 暴露为工具,供任何 MCP 兼容的 AI 客户端调用
- SKILL.md — 可直接作为 Claude Code 技能使用
安装
git clone https://github.com/bzd6661/wechat-article-for-ai.git
cd wechat-article-for-ai
pip install -r requirements.txt
Camoufox 浏览器会在首次运行时自动下载。
使用方法
CLI — 单篇文章
python main.py "https://mp.weixin.qq.com/s/文章ID"
CLI — 批量转换
python main.py -f urls.txt -o ./output -v
CLI 参数
| 参数 | 说明 |
|---|---|
urls |
一个或多个微信文章链接 |
-f, --file 文件 |
包含 URL 的文本文件(每行一个,# 为注释) |
-o, --output 目录 |
输出目录(默认:./output) |
-c, --concurrency N |
图片下载最大并发数(默认:5) |
--no-images |
跳过图片下载,保留远程链接 |
--no-headless |
显示浏览器窗口(用于手动解决验证码) |
--force |
覆盖已有的输出目录 |
--no-frontmatter |
使用引用块格式的元数据,而非 YAML frontmatter |
-v, --verbose |
启用调试日志 |
MCP 服务器
作为 MCP 服务器运行,供 AI 工具集成:
python mcp_server.py
暴露的工具:
convert_article— 转换单篇微信文章为 Markdownbatch_convert— 批量转换多篇文章
MCP 客户端配置(如 claude_desktop_config.json):
{
"mcpServers": {
"wechat-to-md": {
"command": "python",
"args": ["mcp_server.py"],
"cwd": "/path/to/wechat-article-for-ai"
}
}
}
输出结构
output/
<文章标题>/
<文章标题>.md
images/
img_001.png
img_002.jpg
...
项目结构
wechat_to_md/
__init__.py # 包初始化,公共 API
errors.py # CaptchaError, NetworkError, ParseError
utils.py # 日志、文件名清理、时间戳、图片格式推断
scraper.py # Camoufox + networkidle + 指数退避重试
parser.py # BeautifulSoup:元数据、代码块、媒体、噪音移除
converter.py # markdownify + YAML frontmatter + 图片 URL 替换
downloader.py # httpx 异步 + 逐图重试 + Content-Type 推断
cli.py # argparse CLI,支持批量处理
mcp_server.py # FastMCP 服务器
main.py # CLI 入口
mcp_server.py # MCP 服务器入口
SKILL.md # AI 技能定义文件
常见问题
| 问题 | 解决方法 |
|---|---|
| 出现验证码 / 环境异常 | 使用 --no-headless 手动解决验证码 |
| 内容为空 | 微信可能在限流,等几分钟再试 |
| 图片下载失败 | 失败的图片会保留远程链接,用 --force 重新运行 |
许可证
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.