kordoc

kordoc

An MCP server that parses South Korean document formats like HWP, HWPX, and PDF into Markdown. It features specialized table reconstruction and security-hardened extraction optimized for administrative and public institution files.

Category
Visit Server

README

kordoc

모두 파싱해버리겠다 — Parse any Korean document to Markdown.

npm version license node

HWP, HWPX, PDF — 대한민국 문서라면 남김없이 파싱해버립니다.

한국어

kordoc demo


Why kordoc?

South Korea's government runs on HWP — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of .hwp files. Extracting text from them has always been a nightmare: COM automation that only works on Windows, proprietary binary formats with zero documentation, and tables that break every existing parser.

kordoc was born from that document hell. Built by a Korean civil servant who spent 7 years buried under HWP files at a district office. One day he snapped — and decided to parse them all. Its parsers have been battle-tested across 5 real government projects, processing school curriculum plans, facility inspection reports, legal annexes, and municipal newsletters. If a Korean public servant wrote it, kordoc can parse it.


Features

  • HWP 5.x Binary Parsing — OLE2 container + record stream + UTF-16LE. No Hancom Office needed.
  • HWPX ZIP Parsing — OPF manifest resolution, multi-section, nested tables.
  • PDF Text Extraction — Y-coordinate line grouping, table reconstruction, image PDF detection.
  • 2-Pass Table Builder — Correct colSpan/rowSpan via grid algorithm. No broken tables.
  • Broken ZIP Recovery — Corrupted HWPX? Scans raw Local File Headers.
  • 3 Interfaces — npm library, CLI tool, and MCP server (Claude/Cursor).
  • Cross-Platform — Pure JavaScript. Runs on Linux, macOS, Windows.

Supported Formats

Format Engine Features
HWPX (한컴 2020+) ZIP + XML DOM Manifest, nested tables, merged cells, broken ZIP recovery
HWP 5.x (한컴 레거시) OLE2 + CFB 21 control chars, zlib decompression, DRM detection
PDF pdfjs-dist Line grouping, table detection, image PDF warning

Installation

npm install kordoc

# PDF support requires pdfjs-dist (optional peer dependency)
npm install pdfjs-dist

pdfjs-dist is an optional peer dependency. Not needed for HWP/HWPX parsing.

Usage

As a Library

import { parse } from "kordoc"
import { readFileSync } from "fs"

const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer)

if (result.success) {
  console.log(result.markdown)
}

Format-Specific

import { parseHwpx, parseHwp, parsePdf } from "kordoc"

const hwpxResult = await parseHwpx(buffer)   // HWPX
const hwpResult  = await parseHwp(buffer)    // HWP 5.x
const pdfResult  = await parsePdf(buffer)    // PDF

Format Detection

import { detectFormat } from "kordoc"

detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"

As a CLI

npx kordoc document.hwpx                    # stdout
npx kordoc document.hwp -o output.md        # save to file
npx kordoc *.pdf -d ./converted/            # batch convert
npx kordoc report.hwpx --format json        # JSON with metadata

As an MCP Server

Works with Claude Desktop, Cursor, Windsurf, and any MCP-compatible client.

{
  "mcpServers": {
    "kordoc": {
      "command": "npx",
      "args": ["-y", "kordoc-mcp"]
    }
  }
}

Tools exposed:

Tool Description
parse_document Parse HWP/HWPX/PDF file → Markdown
detect_format Detect file format via magic bytes

API Reference

parse(buffer: ArrayBuffer): Promise<ParseResult>

Auto-detects format and converts to Markdown.

interface ParseResult {
  success: boolean
  markdown?: string
  fileType: "hwpx" | "hwp" | "pdf" | "unknown"
  isImageBased?: boolean     // scanned PDF detection
  pageCount?: number         // PDF only
  error?: string
}

Types

import type { ParseResult, ParseSuccess, ParseFailure, FileType } from "kordoc"

Internal types (IRBlock, IRTable, IRCell, CellContext) and utilities (KordocError, sanitizeError, isPathTraversal, buildTable, blocksToMarkdown) are not part of the public API.

Requirements

  • Node.js >= 18
  • pdfjs-dist >= 4.0.0 — Optional. Only needed for PDF. HWP/HWPX work without it.

Security

Production-grade security hardening:

  • ZIP bomb protection — Entry count validation, 100MB decompression limit, 500 entry cap

    Known limitation: Pre-check reads declared sizes from ZIP Central Directory, which an attacker can falsify. The primary defense is per-file cumulative size tracking during actual decompression. For fully untrusted input where streaming decompression is required, consider wrapping kordoc behind a size-limited sandbox.

  • XXE/Billion Laughs prevention — Internal DTD subsets fully stripped from HWPX XML
  • Decompression bomb guardmaxOutputLength on HWP5 zlib streams, cumulative 100MB limit across sections
  • PDF resource limits — MAX_PAGES=5,000, cumulative text size 100MB cap, doc.destroy() cleanup
  • HWP5 record cap — Max 500,000 records per section, prevents memory exhaustion from crafted files
  • Table dimension clamping — rows/cols read from HWP5 binary clamped to MAX_ROWS/MAX_COLS before allocation
  • colSpan/rowSpan clamping — Crafted merge values clamped to grid bounds (MAX_COLS=200, MAX_ROWS=10,000)
  • Path traversal guard — Backslash normalization, .., absolute paths, Windows drive letters all rejected
  • MCP error sanitization — Allowlist-based error filtering, unknown errors return generic message
  • MCP path restriction — Only .hwp, .hwpx, .pdf extensions allowed, symlink resolution
  • File size limit — 500MB max in MCP server and CLI
  • HWP5 section limit — Max 100 sections in both primary and fallback paths
  • HWP5 control char fix — Character code 10 (footnote/endnote) now correctly handled

How It Works

┌─────────────┐     Magic Bytes      ┌──────────────────┐
│  File Input  │ ──── Detection ────→ │  Format Router   │
└─────────────┘                       └────────┬─────────┘
                                               │
                    ┌──────────────────────────┼──────────────────────────┐
                    │                          │                          │
              ┌─────▼─────┐            ┌───────▼───────┐          ┌──────▼──────┐
              │   HWPX    │            │    HWP 5.x    │          │     PDF     │
              │  ZIP+XML  │            │  OLE2+Record  │          │  pdfjs-dist │
              └─────┬─────┘            └───────┬───────┘          └──────┬──────┘
                    │                          │                          │
                    │       ┌──────────────────┤                          │
                    │       │                  ��                          │
              ┌─────▼───────▼─────┐            │                          │
              │  2-Pass Table     │            │                          │
              │  Builder (Grid)   │            │                          │
              └─────────┬─────────┘            │                          │
                        │                      │                          │
                  ┌─────▼──────────────────────▼──────────────────────────▼─────┐
                  │                      IRBlock[]                              │
                  │              (Intermediate Representation)                  │
                  └────────────────────────┬───────────────────────────────────┘
                                           │
                                    ┌──────▼──────┐
                                    │  Markdown   │
                                    │   Output    │
                                    └─────────────┘

Credits

Production-tested across 5 Korean government technology projects:

  • School curriculum plans (학교교육과정)
  • Facility inspection reports (사전기획 보고서)
  • Legal document annexes (법률 별표)
  • Municipal newsletters (소식지)
  • Public data extraction tools (공공데이터)

Thousands of real government documents parsed without breaking a sweat.

License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured