wechat-to-md

wechat-to-md

Converts WeChat Official Account articles to clean Markdown with locally downloaded images, supporting single and batch operations via MCP tools.

Category
Visit Server

README

wechat-article-for-ai

English | 中文


English

A modular Python tool that converts WeChat Official Account (微信公众号) articles into clean Markdown files with locally downloaded images. Designed for both human use (CLI) and AI agent integration (MCP server + SKILL.md).

Features

  • Anti-detection scraping — Uses Camoufox (stealth Firefox) to bypass WeChat's bot detection
  • Smart page loadingnetworkidle wait instead of hardcoded sleep
  • Retry logic — 3× exponential backoff for page fetching, 3× linear backoff for image downloads
  • CAPTCHA detection — Explicit detection with actionable error messages
  • Batch processing — Multiple URLs via args or file input
  • Image localization — Concurrent async downloads with Content-Type based extension inference
  • Code block preservation — Language detection, CSS counter garbage filtering
  • Media extraction — Handles WeChat's <mpvoice> audio and <mpvideo> video elements
  • YAML frontmatter — Structured metadata (title, author, date, source)
  • MCP server — Expose as tools for any MCP-compatible AI client
  • SKILL.md — Ready for Claude Code skill integration

Installation

git clone https://github.com/bzd6661/wechat-article-for-ai.git
cd wechat-article-for-ai
pip install -r requirements.txt

Camoufox browser will be auto-downloaded on first run.

Usage

CLI — Single Article

python main.py "https://mp.weixin.qq.com/s/ARTICLE_ID"

CLI — Batch from File

python main.py -f urls.txt -o ./output -v

CLI Options

Flag Description
urls One or more WeChat article URLs
-f, --file FILE Text file with URLs (one per line, # for comments)
-o, --output DIR Output directory (default: ./output)
-c, --concurrency N Max concurrent image downloads (default: 5)
--no-images Skip image download, keep remote URLs
--no-headless Show browser window (for solving CAPTCHAs)
--force Overwrite existing output
--no-frontmatter Use blockquote metadata instead of YAML frontmatter
-v, --verbose Enable debug logging

MCP Server

Run as an MCP server for AI tool integration:

python mcp_server.py

Tools exposed:

  • convert_article — Convert a single WeChat article to Markdown
  • batch_convert — Convert multiple articles in one call

MCP client configuration (e.g. claude_desktop_config.json):

{
  "mcpServers": {
    "wechat-to-md": {
      "command": "python",
      "args": ["mcp_server.py"],
      "cwd": "/path/to/wechat-article-for-ai"
    }
  }
}

Output Structure

output/
  <article-title>/
    <article-title>.md
    images/
      img_001.png
      img_002.jpg
      ...

Project Structure

wechat_to_md/
  __init__.py        # Package init, public API
  errors.py          # CaptchaError, NetworkError, ParseError
  utils.py           # Logging, filename sanitizer, timestamp, image ext inference
  scraper.py         # Camoufox + networkidle + retry with exponential backoff
  parser.py          # BeautifulSoup: metadata, code blocks, media, noise removal
  converter.py       # markdownify + YAML frontmatter + image URL replacement
  downloader.py      # httpx async + retry per image + Content-Type inference
  cli.py             # argparse CLI with batch support
  mcp_server.py      # FastMCP server with convert_article / batch_convert
main.py              # CLI entry point
mcp_server.py        # MCP server entry point
SKILL.md             # AI skill definition

Troubleshooting

Problem Solution
CAPTCHA / verification page Run with --no-headless to solve manually
Empty content WeChat may be rate-limiting; wait and retry
Image download failures Failed images keep remote URLs; re-run with --force

License

MIT


中文

一个模块化的 Python 工具,将微信公众号文章转换为干净的 Markdown 文件并下载图片到本地。同时支持人工使用(CLI)和 AI 智能体集成(MCP 服务器 + SKILL.md)。

功能特点

  • 反检测抓取 — 使用 Camoufox(隐身 Firefox)绕过微信的反爬机制
  • 智能页面等待 — 使用 networkidle 替代硬编码的 sleep
  • 重试机制 — 页面加载 3 次指数退避重试,图片下载 3 次线性退避重试
  • 验证码检测 — 明确识别验证码页面并给出可操作的错误提示
  • 批量处理 — 支持多个 URL 参数或从文件读取
  • 图片本地化 — 异步并发下载,基于 Content-Type 推断图片格式
  • 代码块保留 — 自动检测编程语言,过滤 CSS 计数器垃圾文本
  • 媒体提取 — 处理微信的 <mpvoice> 音频和 <mpvideo> 视频元素
  • YAML 元数据 — 结构化的 frontmatter(标题、作者、日期、来源)
  • MCP 服务器 — 暴露为工具,供任何 MCP 兼容的 AI 客户端调用
  • SKILL.md — 可直接作为 Claude Code 技能使用

安装

git clone https://github.com/bzd6661/wechat-article-for-ai.git
cd wechat-article-for-ai
pip install -r requirements.txt

Camoufox 浏览器会在首次运行时自动下载。

使用方法

CLI — 单篇文章

python main.py "https://mp.weixin.qq.com/s/文章ID"

CLI — 批量转换

python main.py -f urls.txt -o ./output -v

CLI 参数

参数 说明
urls 一个或多个微信文章链接
-f, --file 文件 包含 URL 的文本文件(每行一个,# 为注释)
-o, --output 目录 输出目录(默认:./output
-c, --concurrency N 图片下载最大并发数(默认:5)
--no-images 跳过图片下载,保留远程链接
--no-headless 显示浏览器窗口(用于手动解决验证码)
--force 覆盖已有的输出目录
--no-frontmatter 使用引用块格式的元数据,而非 YAML frontmatter
-v, --verbose 启用调试日志

MCP 服务器

作为 MCP 服务器运行,供 AI 工具集成:

python mcp_server.py

暴露的工具:

  • convert_article — 转换单篇微信文章为 Markdown
  • batch_convert — 批量转换多篇文章

MCP 客户端配置(如 claude_desktop_config.json):

{
  "mcpServers": {
    "wechat-to-md": {
      "command": "python",
      "args": ["mcp_server.py"],
      "cwd": "/path/to/wechat-article-for-ai"
    }
  }
}

输出结构

output/
  <文章标题>/
    <文章标题>.md
    images/
      img_001.png
      img_002.jpg
      ...

项目结构

wechat_to_md/
  __init__.py        # 包初始化,公共 API
  errors.py          # CaptchaError, NetworkError, ParseError
  utils.py           # 日志、文件名清理、时间戳、图片格式推断
  scraper.py         # Camoufox + networkidle + 指数退避重试
  parser.py          # BeautifulSoup:元数据、代码块、媒体、噪音移除
  converter.py       # markdownify + YAML frontmatter + 图片 URL 替换
  downloader.py      # httpx 异步 + 逐图重试 + Content-Type 推断
  cli.py             # argparse CLI,支持批量处理
  mcp_server.py      # FastMCP 服务器
main.py              # CLI 入口
mcp_server.py        # MCP 服务器入口
SKILL.md             # AI 技能定义文件

常见问题

问题 解决方法
出现验证码 / 环境异常 使用 --no-headless 手动解决验证码
内容为空 微信可能在限流,等几分钟再试
图片下载失败 失败的图片会保留远程链接,用 --force 重新运行

许可证

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured