mcp-document-intelligence
Enables AI assistants to analyze, rename, categorize, and organize local documents using OCR, metadata extraction, and batch processing, with preview mode and undo support.
README
+--------------------------------------------------------------------------------+
| |
| __ __ ____ ____ ____ _ |
| | \/ |/ ___| _ \ | _ \ ___ ___ _ _ _ __ ___ | |_ ___ |
| | |\/| | | | |_) |_____| | | |/ _ \ / __| | | | '_ ` _ \ | __/ _ \ |
| | | | | |___| __/_____| |_| | (_) | (__| |_| | | | | | | | || __/ |
| |_| |_|\____|_| |____/ \___/ \___|\__,_|_| |_| |_| \__\___| |
| |
| ____ _ _ _ _ _ |
| |_ _|_ __ | |_ ___| | (_) __ _ ___ _ __ ___| | |
| | || '_ \ | __/ _ \ | | |/ _` |/ _ \ '_ \ / __| | |
| | || | | || || __/ | | | (_| | __/ | | | (__| | |
| |___|_| |_| \__\___|_|_|_|\__, |\___|_| |_|\___|_| |
| |___/ |
| |
| MCP SERVER | OCR | ANALYZE | ORGANIZE |
| |
+--------------------------------------------------------------------------------+
MCP Document Intelligence Server
Model Context Protocol Server with Advanced Batch Processing & Intelligent Document Organization
๐ฏ Built for Perplexity Desktop, Claude Desktop, and other MCP-compatible clients โ Supercharge your AI assistant with enterprise-grade document intelligence.
This MCP server analyzes, renames, categorizes, and organizes documents through natural-language requests in your AI client. It scans folders recursively, detects duplicates, extracts metadata from PDFs, DOCX, Pages, images, and text files, and uses OCR for scanned documents. Preview mode, backups, undo support, metadata export, and memory-optimized batch processing make it practical for large personal archives. The result is an AI-driven document workflow that stays local, fast, and automatable.
๐ฏ Features
๐ NEW in v4.6 - Automatic OCR Integration
- ๐ธ Auto-OCR for Scanned PDFs: Automatically falls back to Tesseract OCR when text extraction yields < 50 characters
- ๐ผ๏ธ Image Support: Process .jpg, .jpeg, .png files with OCR in all tools
- ๐ฉ๐ช German Language: Pre-configured with German language support (
-l deu) - ๐ข Enhanced Entity Detection: Vodafone, Telekom, O2, DHL, Amazon now recognized
- ๐ Fixed Categorization: Vodafone โ 11_Telekommunikation (not insurance!)
- โก Graceful Fallback: pdftotext โ OCR โ empty string (30s timeout per file)
- ๐ More Document Types: Added Rezept, Kรผndigung, Mahnung patterns
- ๐ Full Archive Scan: New script analyzed 693 files, improved 559 with OCR
- ๐ Documentation: Complete OCR-INTEGRATION.md guide
โก v4.5 - Advanced Archive Management
- ๐งน cleanup_old_structure: Removes old folder hierarchies, consolidates into standard categories
- ๐ optimize_folder_structure: Deletes empty folders, moves single-file categories to 99_Sonstiges
- ๐ค intelligent_rename: PDF content analysis for smart naming (extracts companies, document types)
- ๐ move_loose_files: Pattern-based categorization for loose files
- ๐ฏ Production Ready: Tested with 2,714 files, fully automated workflow
- ๐ Complete Documentation: PRODUCTION-SETUP.md and TESTFALL-PERPLEXITY.md
โก v4.4 - Performance Optimizations
- ๐ฏ Memory-Efficient Processing: Generator-based file scanning - no memory overflow
- ๐ Batch Processing: Processes 25 files per batch with automatic pauses
- ๐ก๏ธ Safety Limits: Configurable limits (500 files/year) prevent system crashes
- ๐งน Garbage Collection: Explicit memory cleanup between batches
- โธ๏ธ Progressive Processing: Resume-friendly architecture with state tracking
- ๐ Reduced Memory Footprint: 90% reduction vs previous versions
๐ค v4.3 - Autonomous Organization
- ๐ auto_organize_folder: Analyzes AND organizes folders automatically
- ๐ฅ process_downloads: Auto-files Downloads into archive with category detection
- ๐งฉ batch_organize_large: Processes >100 files in chunks with resume capability
- ๐ Smart Categorization: Auto-detects 10+ categories (Finanzen, Gesundheit, Reisen, etc.)
- ๐พ State Persistence: Resume interrupted operations from JSON state files
- ๐ฏ Decade Detection: Automatically routes to Achziger/Neunziger/Nuller/Zehner/Zwanziger
๏ฟฝ v4.2 - Full PDF OCR Support
- ๐ Scanned PDF Intelligence: Complete OCR solution for image-based PDFs
- ๐ค Automatic Detection: Smart fallback from text extraction to OCR (<50 chars triggers OCR)
- ๐ Quality Metrics: OCR confidence scores and quality assessment
- โก Optimized Processing: PDF.js rendering + Tesseract OCR (up to 5 pages)
- ๐ German Language Model: Pre-configured for local documents
- ๐ Apache-2.0 Licensed: No licensing issues with PDF.js (Mozilla)
๏ฟฝ๐ v4.1 - Quality & Performance Enhancements
- ๐งช Comprehensive Testing: 100 automated test cases with 99% pass rate
- ๐ Performance Metrics: Real-time processing stats and throughput reporting
- ๐ Enhanced Validation: File size, name length, and type validation
- ๐ Better Encoding Detection: Automatic UTF-8/Latin-1 switching with reporting
- โก Optimized Processing: Average <100ms per file, batch <2000ms
- ๐ก๏ธ Robust Error Handling: Structured errors with actionable suggestions
๐ v4.0 - Enterprise Features
- ๐ Recursive Scanning: Deep folder analysis up to 10 levels
- ๐ฅ Duplicate Detection: SHA256-based file deduplication
- ๐๏ธ Preview Mode: Dry-run operations before execution
- โฎ๏ธ Backup & Undo: Automatic backups with one-click restore
- ๐ Metadata Export: Export analysis results to JSON/CSV
- ๐ฏ Smart Filters: Filter by file type and keywords
- ๐ Copy Mode: Copy files instead of moving them
- โ๏ธ Configurable Rules: Custom folder organization patterns
- ๐จ OCR Quality Feedback: Confidence scores for scanned documents
- ๐ Detailed Statistics: Comprehensive operation summaries
๐ฆ Batch Document Processing
- Folder Scanning: Analyze entire folders recursively in one operation
- Batch Organization: Rename and move multiple files automatically
- Smart Folder Structure: Auto-generate organized folder hierarchies
- Workflow Automation: Scan โ Analyze โ Preview โ Organize โ Undo
๏ฟฝ๐ Multi-Format Document Intelligence
- Text Extraction: Extract text from PDF, DOCX, Pages, Images, TXT
- Full PDF OCR: PDF.js + Tesseract.js for scanned PDFs (automatic fallback)
- OCR Quality Scoring: Confidence metrics and quality assessment for all OCR operations
- Multi-Encoding Support: Automatic detection and handling of UTF-8, Latin-1/ISO-8859-1
- Robust Parsing: Handles null-bytes, special characters, and unusual file names
- Smart Filename Suggestions: Automatically extracts:
- Scanner timestamps (preserves existing
2024-01-24_14-30-45format) - Document dates (DD.MM.YYYY, YYYY-MM-DD)
- Reference numbers (Invoice#, Customer#, Order#, Contract#)
- Keywords (Invoice, Contract, Company names)
- Scanner timestamps (preserves existing
๐ Quick Start
System Requirements
For full PDF-OCR support, install these system tools:
# macOS (via Homebrew)
brew install tesseract tesseract-lang # OCR engine with all languages
brew install poppler # PDF rendering tools (pdftoppm)
Prerequisites
You need one of these AI desktop clients:
- Perplexity Desktop App (macOS/Windows)
- Claude Desktop App (macOS/Windows)
Both support the Model Context Protocol (MCP) for extending AI capabilities with custom tools.
Installation
git clone https://github.com/AndreasDietzel/mcp-document-intelligence.git
cd mcp-document-intelligence
npm install
npm run build
Configuration
Add to your MCP client configuration:
For Perplexity Desktop:
Location: ~/Library/Application Support/Perplexity/perplexity-config.json (macOS)
{
"mcpServers": {
"document-intelligence": {
"command": "node",
"args": ["/path/to/mcp-document-intelligence/build/index.js"],
"env": {
"NODE_ENV": "production"
}
}
}
}
For Claude Desktop:
Location: ~/Library/Application Support/Claude/claude_desktop_config.json (macOS)
{
"mcpServers": {
"document-intelligence": {
"command": "node",
"args": ["/path/to/mcp-document-intelligence/build/index.js"],
"env": {
"NODE_ENV": "production"
}
}
}
}
Restart your AI client after updating the configuration.
MCP Filesystem Integration
This server works best alongside the official MCP Filesystem Server, which provides file browsing and management capabilities to your AI assistant.
Recommended Setup for Perplexity/Claude:
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-filesystem",
"/path/to/your/documents"
]
},
"document-intelligence": {
"command": "node",
"args": ["/path/to/mcp-document-intelligence/build/index.js"]
}
}
}
With both servers configured, you can:
- Ask your AI to find documents in your folders (filesystem server)
- Analyze and organize them automatically (document-intelligence server)
- Complete workflow handled by natural conversation with Perplexity/Claude
๐ Available Tools
analyze_document
Analyzes a single document and suggests an intelligent filename.
Input:
{
"filePath": "/path/to/document.pdf"
}
Output Example:
{
"originalFilename": "20240124_143045_scan.pdf",
"suggestedFilename": "20240124_143045_RE-2024-1234_rechnung_telekom.pdf",
"documentDate": "24.01.2024",
"references": ["RE-2024-1234"],
"keywords": ["rechnung", "telekom"],
"scannerDatePreserved": true,
"textLength": 2450,
"preview": "Rechnung Nr. RE-2024-1234..."
}
analyze_folder โจ Enhanced in v4.0
Analyzes ALL documents in a folder (batch processing with recursive scanning, duplicate detection, and filtering).
Input:
{
"folderPath": "/path/to/folder",
"recursive": true,
"fileTypes": ["invoice", "contract"],
"keywords": ["telekom", "vodafone"]
}
Output:
{
"folderPath": "/path/to/folder",
"totalFiles": 15,
"duplicateGroups": [
{
"hash": "abc123...",
"count": 3,
"files": ["doc1.pdf", "doc1_copy.pdf", "duplicate.pdf"]
}
],
"documents": [
{
"originalPath": "...",
"suggestedFilename": "...",
"ocrQuality": "good",
"confidence": 0.95,
"metadata": {...}
}
]
}
suggest_folder_structure โจ NEW in v3.0
Suggests intelligent folder organization based on analyzed documents.
Input:
{
"documents": [ /* array from analyze_folder */ ]
}
auto_organize_folder ๐ค NEW in v4.3
Analyzes AND organizes a folder automatically - combines analysis + rename/move in one step.
Input:
{
"sourcePath": "/path/to/source",
"archivePath": "/path/to/archive",
"dryRun": false,
"createCategories": true,
"stateFile": "/tmp/state.json"
}
Output:
{
"total": 150,
"processed": 147,
"moved": 145,
"categorized": {
"01_Finanzen": 45,
"03_Gesundheit": 12,
"06_Reisen": 23,
"99_Sonstiges": 65
},
"errors": [{"file": "...", "error": "..."}]
}
process_downloads ๐ฅ NEW in v4.3
Scans Downloads folder and automatically files documents into archive with year and category detection.
Input:
{
"downloadsPath": "~/Downloads",
"archiveBasePath": "/path/to/archive",
"autoMove": false,
"maxFiles": 50
}
Output:
{
"scanned": 23,
"suggestions": [
{
"from": "~/Downloads/Rechnung.pdf",
"to": "/archive/Zwanziger/2025/01_Finanzen/2025-03-15_Rechnung_Telekom.pdf",
"year": 2025,
"category": "Finanzen"
}
],
"filed": [],
"errors": []
}
batch_organize_large ๐งฉ NEW in v4.3
Processes large folders (>100 files) in chunks with resume capability.
Input:
{
"folderPath": "/path/to/large/folder",
"targetArchivePath": "/path/to/archive",
"chunkSize": 50,
"stateFilePath": "/tmp/batch-state.json"
}
Output:
{
"chunkCompleted": true,
"totalFiles": 500,
"processedFiles": 50,
"successCount": 48,
"errorCount": 2,
"percentComplete": 10,
"nextChunkExists": true,
"stateFilePath": "/tmp/batch-state.json"
}
Output:
{
"structure": {
"2024": {
"Rechnungen": ["Telekom", "Vodafone"],
"Vertraege": ["..."]
}
},
"assignments": [
{
"originalPath": "/path/scan001.pdf",
"targetFolder": "2024/Rechnungen/Telekom",
"newFilename": "2024-01-24_RE-123_rechnung_telekom.pdf"
}
]
}
batch_organize โจ Enhanced in v4.0
Executes batch renaming and moving/copying of files with automatic backup.
Input:
{
"baseFolder": "/path/to/organized",
"mode": "move",
"createBackup": true,
"operations": [
{
"originalPath": "/path/scan001.pdf",
"targetFolder": "2024/Rechnungen/Telekom",
"newFilename": "2024-01-24_RE-123_rechnung_telekom.pdf"
}
]
}
Output:
{
"success": true,
"mode": "move",
"filesProcessed": 15,
"filesFailed": 0,
"foldersCreated": 5,
"backupCreated": true,
"backupPath": "/path/.backup_2024-01-24T10-30-00.json",
"results": [...]
}
preview_organization โจ NEW in v4.0
Shows a dry-run preview of what would happen without making changes.
Input:
{
"baseFolder": "/path/to/organized",
"operations": [ /* same as batch_organize */ ]
}
Output:
{
"preview": [
{
"action": "move",
"from": "/path/scan001.pdf",
"to": "/path/organized/2024/Rechnungen/Telekom/2024-01-24_RE-123.pdf",
"status": "ok"
}
],
"warnings": [],
"stats": {
"totalFiles": 15,
"foldersToCreate": ["2024/Rechnungen/Telekom"],
"conflicts": 0,
"missingFiles": 0
},
"safeToExecute": true
}
undo_last_organization โจ NEW in v4.0
Restores the last organization operation from automatic backup.
Input:
{
"baseFolder": "/path/to/organized"
}
Output:
{
"success": true,
"restored": 15,
"failed": 0,
"backupFile": "/path/.backup_2024-01-24T10-30-00.json"
}
export_metadata โจ NEW in v4.0
Exports analyzed document metadata to JSON or CSV format.
Input:
{
"documents": [ /* array from analyze_folder */ ],
"format": "csv"
}
Output (CSV):
Filename,Path,Date,References,Keywords,OCR Quality,Confidence,Type
scan001.pdf,/path/scan001.pdf,24.01.2024,RE-2024-1234,rechnung;telekom,good,0.95,invoice
...
๐ง Use Cases
Single Document Analysis
analyze_document with filePath: "/path/to/scanned_invoice.pdf"
โ Extracts invoice number, date, company name and suggests:
2024-01-24_INV-2024-001_rechnung_telekom.pdf
Advanced Batch Organization (v4.0)
Example conversation with Perplexity or Claude:
You: "Analyze all documents recursively in my 2026 folder, find duplicates"
AI (using analyze_folder with recursive=true):
โ Scanned 10 levels deep
โ Found 150 documents
โ Detected 12 duplicates (3 groups)
โ Extracted: dates, invoice numbers, companies
โ OCR quality: 95% confidence average
AI (using suggest_folder_structure):
โ Proposes: 2026/Rechnungen/Telekom, 2026/Vertraege/Vodafone, etc.
โ Shows: Complete list of file renames and target folders
AI (using preview_organization):
โ Preview: 150 files will be moved
โ Folders to create: 8
โ Conflicts: 0
โ Safe to execute: YES
AI: "I found 150 documents (12 duplicates). Should I organize them into
2026/Rechnungen, 2026/Vertraege with smart filenames?"
You: "Yes, but copy instead of moving"
AI (using batch_organize with mode="copy", createBackup=true):
โ Copies all files (originals preserved)
โ Creates folder structure
โ Backup created for undo
โ Processes everything automatically
AI: "Done! Organized 150 documents, created 8 folders, backup saved.
15 files processed, 0 failed. Type 'undo' to revert."
You: "Export the metadata as CSV"
AI (using export_metadata with format="csv"):
โ Exports all document metadata
โ Includes: filename, date, references, keywords, OCR quality
AI: "CSV exported with all metadata for 150 documents."
Complete Workflow v4.0:
- Scanner saves to "Inbox" folder
- Tell Perplexity/Claude to analyze recursively + find duplicates
- Preview changes before execution
- Confirm with copy or move mode
- Files auto-organized with automatic backup
- Export metadata for records
- Undo anytime if needed
Workflow Automation
- Before: Manual sorting of 100+ scanned documents
- After: One command โ Preview โ Organization in seconds
- Safety: Automatic backups, preview mode, undo function
- Perfect for: Tax documents, invoices, contracts, receipts, archives
๐ ๏ธ Technical Details
Dependencies
- pdf-parse: PDF text extraction
- mammoth: DOCX document processing
- adm-zip: Pages document extraction
- tesseract.js: OCR for scanned documents and images
- @modelcontextprotocol/sdk: MCP protocol implementation
File Structure
mcp-document-intelligence/
โโโ src/
โ โโโ index.ts # Main server implementation
โโโ build/ # Compiled output
โโโ package.json
โโโ tsconfig.json
โโโ README.md
Filename Pattern Recognition
The analyzer recognizes:
- Scanner timestamps:
YYYY-MM-DD_HH-MM-SSorYYYYMMDD_HHMMSS - Document dates:
DD.MM.YYYY,YYYY-MM-DD - Reference patterns:
Rechnungs-Nr: XXX/Invoice: XXXKunden-Nr: XXX/Customer: XXXBestell-Nr: XXX/Order: XXXVertrag-Nr: XXX/Contract: XXX
- Keywords: Invoice, Contract, Offer, Order, common company names
Folder Structure Generation
Automatically groups documents by:
- Year: Extracted from document date
- Category: Rechnungen, Vertrรคge, Angebote, Mahnungen, etc.
- Company: Telekom, Vodafone, Amazon, PayPal, Banks, etc.
๐ Privacy & Security
- โ All data stays local - No external API calls for personal data
- โ OCR processing on-device - Tesseract.js runs locally
- โ No data transmission - All processing happens locally
- โ No logging of document content
๐ Encoding & International Support
- โ Automatic Encoding Detection - UTF-8 and Latin-1/ISO-8859-1
- โ International Characters - Full Unicode support (ๆฅๆฌ่ช, ไธญๆ, ุงูุนุฑุจูุฉ, ืขืืจืืช)
- โ German Umlauts - Native support for รครถรผรรรร
- โ Special Characters - Handles pipes, colons, quotes in filenames
- โ Null-Byte Handling - Automatically cleans corrupted files
- โ Encoding Info - Reports detected encoding in analysis results
๐งช Quality Assurance
This project maintains high quality standards with comprehensive testing:
- 100 Automated Tests covering all functionality
- 99% Test Pass Rate (99/100 tests passing)
- ISO 25010 Compliant - Quality characteristics validated
- Performance Benchmarks - <100ms per file average
- Security Tested - Path traversal and data privacy verified
See TEST-DOCUMENTATION.md for detailed test coverage and results.
๐ Roadmap
- [x] Multi-format document support (PDF, DOCX, Pages, Images, TXT)
- [x] Batch processing support
- [x] Auto-filing to folders based on content
- [x] Recursive folder scanning
- [x] Duplicate detection with SHA256
- [x] Preview mode (dry-run)
- [x] Backup & Undo functionality
- [x] Metadata export (JSON/CSV)
- [x] Copy vs Move modes
- [x] OCR quality feedback
- [ ] Configurable naming templates
- [ ] Custom reference number patterns
- [ ] Excel/CSV document support
- [ ] Integration with document management systems
- [ ] Machine learning for improved categorization
๐ License
MIT License - see LICENSE file
๐ Acknowledgments
- Built with Model Context Protocol SDK
- Inspired by and based on concepts from MCP Filesystem Server
- PDF parsing by pdf-parse
- OCR by Tesseract.js
Made for intelligent document workflows ๐โจ
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.