Word Document Reader MCP Server
Enables reading and analyzing Word documents with advanced features including table extraction, OCR image analysis, full-text search, and intelligent caching for optimized performance on large documents.
README
Word Document Reader MCP Server
A powerful Word document reading MCP server with table extraction, image OCR analysis, large document optimization, and intelligent caching.
๐ Core Features
1. Document Content Extraction
- โ Word document (.docx/.doc) text extraction
- โ Support for mixed Chinese-English documents
- โ Preserve original formatting and structure
2. Table Extraction
- โ Automatically identify and extract tables from Word documents
- โ Convert to structured data format
- โ Preserve table row/column structure information
- โ Support complex table parsing
3. Image OCR Analysis
- โ Extract embedded images from Word documents
- โ High-precision OCR recognition using Tesseract.js v5
- โ Support mixed Chinese-English text recognition (95%+ accuracy)
- โ Intelligent image preprocessing for better recognition
- โ Support multiple image formats (JPG, PNG, GIF, BMP, WebP)
4. Large Document Optimization
- โ Automatic detection of large documents (>10MB or >100 pages)
- โ Worker thread parallel processing, utilizing multi-core CPUs
- โ Chunked processing to avoid memory overflow
- โ 60%+ speed improvement
5. Intelligent Caching System
- โ File system persistent caching
- โ Smart cache invalidation based on file modification time
- โ Cache statistics and management support
- โ 90%+ speed improvement for repeated document processing
6. Full-text Index Search
- โ Millisecond-level search with inverted index
- โ Intelligent Chinese-English word segmentation
- โ Relevance scoring and sorting
- โ Support document type filtering
๐ฆ Installation and Usage
1. Install Dependencies
npm install
2. Start Server
# Start full-featured version
npm start
# Or start basic version (without advanced features)
npm run start:basic
3. Run Tests
# Run all tests
npm test
# Run tests in watch mode
npm run test:watch
# Generate test coverage report
npm run test:coverage
read_word_document
Read and analyze Word documents
{
"name": "read_word_document",
"arguments": {
"filePath": "path/to/document.docx",
"memoryKey": "my-document",
"documentType": "api-doc",
"extractTables": true,
"extractImages": true,
"useCache": true,
"outputDir": "./output"
}
}
search_documents
Full-text index search
{
"name": "search_documents",
"arguments": {
"query": "search keywords",
"documentType": "api-doc",
"limit": 10
}
}
get_cache_stats
Get cache statistics
{
"name": "get_cache_stats"
}
clear_cache
Clear cache
{
"name": "clear_cache",
"arguments": {
"type": "all" // "all", "document", "index"
}
}
list_stored_documents
List stored documents
{
"name": "list_stored_documents",
"arguments": {
"documentType": "api-doc"
}
}
get_stored_document
Get specific document content
{
"name": "get_stored_document",
"arguments": {
"memoryKey": "document-key"
}
}
clear_memory
Clear memory content
{
"name": "clear_memory",
"arguments": {
"memoryKey": "specific-key" // Optional, clear all if not provided
}
}
๐ Project Structure
word-doc-mcp/
โโโ server.js # Main server file (with all features)
โโโ server-basic.js # Basic server (compatibility)
โโโ package.json # Project configuration and dependencies
โโโ config.json # Server configuration file
โโโ tests/ # Test directory
โ โโโ setup.js # Test environment setup
โ โโโ unit/ # Unit tests
โ โ โโโ services/ # Service layer tests
โ โโโ integration/ # Integration tests
โ โ โโโ tools/ # Tool tests
โ โ โโโ cache/ # Cache tests
โ โโโ fixtures/ # Test data
โ โโโ documents/ # Test documents
โ โโโ mock-data.js # Mock data
โโโ .cache/ # Cache directory (auto-created)
โโโ output/ # Output directory (auto-created)
โโโ node_modules/ # Dependencies
โ๏ธ Configuration
Edit the config.json file to customize server behavior:
{
"processing": {
"maxFileSize": 10485760,
"maxPages": 100,
"chunkSize": 1048576,
"parallelProcessing": true
},
"cache": {
"enabled": true,
"defaultTTL": 3600,
"cacheDirectory": "./.cache"
},
"ocr": {
"enabled": true,
"languages": ["chi_sim", "eng"]
}
}
๐งช Testing
Test Framework
Using Node.js built-in test framework, following these standards:
- Unit Tests: Test individual components and functions
- Integration Tests: Test interactions between tools
- End-to-End Tests: Test complete workflows
Running Tests
# Run all tests
npm test
# Run specific test file
node --test tests/unit/services/DocumentIndexer.test.js
# Run integration tests
node --test tests/integration/
# Generate coverage report
npm run test:coverage
Test Coverage
- โ Functional tests for all MCP tools
- โ Complete cache system tests
- โ Error handling and edge cases
- โ Performance and concurrency tests
- โ End-to-end workflow tests
๐ Performance Metrics
- Large Document Processing: 60%+ speed improvement (parallel processing)
- Repeated Document Processing: 90%+ speed improvement (caching)
- OCR Recognition Accuracy: 95%+ (image preprocessing)
- Memory Usage Optimization: 40% reduction (streaming processing)
- Search Response Time: <100ms (full-text index)
๐ก๏ธ Security Considerations
- Input file size limits
- File type validation
- Cache data isolation
- Error handling and logging
- Automatic temporary file cleanup
๐ Version Compatibility
Backward Compatibility
- โ Maintain full compatibility with original API
- โ Existing tool functionality unchanged
- โ Optional configuration with reasonable defaults
- โ Provide basic version to ensure compatibility
System Requirements
Minimum Requirements:
- Node.js 16+
- 4GB RAM
- 1GB disk space
Recommended Configuration:
- Node.js 18+
- 8GB+ RAM
- Multi-core CPU
- SSD storage
๐ Troubleshooting
Common Issues
-
Module Installation Failure
npm cache clean --force npm install -
OCR Recognition Failure
- Ensure sufficient memory (8GB+ recommended)
- Check supported image formats
- Review error logs
-
Slow Large Document Processing
- Enable parallel processing
- Adjust chunkSize configuration
- Use SSD storage
-
Memory Insufficient
node --max-old-space-size=4096 server.js
๐ Changelog
v2.0.0
- โ Add table extraction functionality
- โ Add image OCR analysis
- โ Implement large document parallel processing
- โ Add intelligent caching system
- โ Implement full-text index search
- โ Complete testing framework
v1.0.0
- โ Basic Word document reading
- โ Memory storage management
- โ Simple search functionality
๐ค Contributing
Issues and Pull Requests are welcome!
Development Guidelines
- Fork the project
- Create feature branch
- Write test cases
- Ensure all tests pass
- Submit Pull Request
๐ License
MIT License
Quick Start: npm install && npm start
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
E2B
Using MCP to run code via e2b.