CHM to Markdown Converter

CHM to Markdown Converter

chm to markdown

DTDucas

Developer Tools
Visit Server

README

CHM to Markdown Converter

A Python utility for converting Compiled HTML Help (CHM) files to Markdown format. This tool extracts HTML files from CHM documents and converts them to well-formatted Markdown files, making technical documentation more accessible and version control friendly.

Features

  • Extracts CHM files using 7-Zip
  • Converts HTML content to clean Markdown format
  • Special handling for code snippets with language-specific syntax highlighting
  • Preserves and fixes tables
  • Updates internal links to maintain document references
  • Processes files asynchronously for better performance
  • Batch processing with progress reporting

Requirements

  • Python 3.7+
  • 7-Zip installed in the default location (C:\Program Files\7-Zip\7z.exe)
  • The following Python packages:
    • beautifulsoup4
    • html2text
    • aiofiles

Installation

  1. Clone or download this repository
  2. Install required Python packages:
pip install -r requirements.txt

Or install them directly:

pip install beautifulsoup4 html2text aiofiles

Usage

  1. Edit the configuration variables in the main() function of chm_to_markdown.py:
input_folder = r"C:\Path\To\Extracted\Files"  # Temporary folder for extracting CHM
output_folder = r"C:\Path\To\Output\Markdown"  # Where Markdown files will be saved
chm_file_path = r"C:\Path\To\Your\File.chm"    # Your CHM file path
  1. Run the script:
python chm_to_markdown.py
  1. The script will:
    • Clear the input and output folders
    • Extract CHM files to the input folder
    • Convert HTML files to Markdown
    • Save the Markdown files to the output folder

Performance Tuning

You can adjust the following parameters in the process_folder_async() call to optimize performance for your system:

  • max_workers: Number of worker threads for CPU-bound operations
  • semaphore_limit: Maximum concurrent file I/O operations
  • batch_size: Number of files to process in each batch
await process_folder_async(
    input_folder, output_folder, max_workers=8, semaphore_limit=20, batch_size=50
)

Customization

The script provides several customization options for content conversion:

Removing Unwanted Elements

You can customize which HTML elements to remove by editing these lists:

tags_to_remove = ["iframe", "object", "script", "br", "img"]
classes_to_remove = ["collapsibleAreaRegion", "collapsibleRegionTitle", ...]
ids_to_remove = ["PageFooter"]

Code Snippets

The script handles code snippets with language-specific formatting. You can customize the language mapping:

id_to_lang = {
    "IDAB_code_Div1": "csharp",
    "IDAB_code_Div2": "vb",
    "IDAB_code_Div3": "cpp",
    "IDAB_code_Div4": "fsharp",
}

Troubleshooting

  • Missing modules error: Make sure you've installed all required packages and your Python environment is correctly configured.
  • 7-Zip not found: Check that 7-Zip is installed in the default location or update the path in the script.
  • Permission errors: Run your terminal or command prompt with administrator privileges.
  • Memory issues with large CHM files: Try increasing the batch size and reducing max_workers to manage memory usage.

License

This project is open source and available under the MIT License.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
MCP Package Docs Server

MCP Package Docs Server

Facilitates LLMs to efficiently access and fetch structured documentation for packages in Go, Python, and NPM, enhancing software development with multi-language support and performance optimization.

Featured
Local
TypeScript
Claude Code MCP

Claude Code MCP

An implementation of Claude Code as a Model Context Protocol server that enables using Claude's software engineering capabilities (code generation, editing, reviewing, and file operations) through the standardized MCP interface.

Featured
Local
JavaScript
@kazuph/mcp-taskmanager

@kazuph/mcp-taskmanager

Model Context Protocol server for Task Management. This allows Claude Desktop (or any MCP client) to manage and execute tasks in a queue-based system.

Featured
Local
JavaScript
Linear MCP Server

Linear MCP Server

Enables interaction with Linear's API for managing issues, teams, and projects programmatically through the Model Context Protocol.

Featured
JavaScript
mermaid-mcp-server

mermaid-mcp-server

A Model Context Protocol (MCP) server that converts Mermaid diagrams to PNG images.

Featured
JavaScript
Jira-Context-MCP

Jira-Context-MCP

MCP server to provide Jira Tickets information to AI coding agents like Cursor

Featured
TypeScript
Linear MCP Server

Linear MCP Server

A Model Context Protocol server that integrates with Linear's issue tracking system, allowing LLMs to create, update, search, and comment on Linear issues through natural language interactions.

Featured
JavaScript
Sequential Thinking MCP Server

Sequential Thinking MCP Server

This server facilitates structured problem-solving by breaking down complex issues into sequential steps, supporting revisions, and enabling multiple solution paths through full MCP integration.

Featured
Python