MCP Servers

mcp-document-intelligence

Enables AI assistants to analyze, rename, categorize, and organize local documents using OCR, metadata extraction, and batch processing, with preview mode and undo support.

README

+--------------------------------------------------------------------------------+
|                                                                                |
|  __  __  ____ ____        ____                                _                |
| |  \/  |/ ___|  _ \      |  _ \  ___   ___ _   _ _ __ ___   | |_ ___          |
| | |\/| | |   | |_) |_____| | | |/ _ \ / __| | | | '_ ` _ \  | __/ _ \         |
| | |  | | |___|  __/_____| |_| | (_) | (__| |_| | | | | | | | ||  __/         |
| |_|  |_|\____|_|        |____/ \___/ \___|\__,_|_| |_| |_|  \__\___|         |
|                                                                                |
|      ____       _       _ _ _                       _                          |
|     |_ _|_ __  | |_ ___| | (_) __ _  ___ _ __   ___| |                         |
|      | || '_ \ | __/ _ \ | | |/ _` |/ _ \ '_ \ / __| |                         |
|      | || | | || ||  __/ | | | (_| |  __/ | | | (__| |                         |
|     |___|_| |_| \__\___|_|_|_|\__, |\___|_| |_|\___|_|                         |
|                                |___/                                           |
|                                                                                |
|              MCP SERVER  |  OCR  |  ANALYZE  |  ORGANIZE                       |
|                                                                                |
+--------------------------------------------------------------------------------+

MCP Document Intelligence Server

Model Context Protocol Server with Advanced Batch Processing & Intelligent Document Organization

🎯 Built for Perplexity Desktop, Claude Desktop, and other MCP-compatible clients – Supercharge your AI assistant with enterprise-grade document intelligence.

This MCP server analyzes, renames, categorizes, and organizes documents through natural-language requests in your AI client. It scans folders recursively, detects duplicates, extracts metadata from PDFs, DOCX, Pages, images, and text files, and uses OCR for scanned documents. Preview mode, backups, undo support, metadata export, and memory-optimized batch processing make it practical for large personal archives. The result is an AI-driven document workflow that stays local, fast, and automatable.

🎯 Features

🔍 NEW in v4.6 - Automatic OCR Integration

📸 Auto-OCR for Scanned PDFs: Automatically falls back to Tesseract OCR when text extraction yields < 50 characters
🖼️ Image Support: Process .jpg, .jpeg, .png files with OCR in all tools
🇩🇪 German Language: Pre-configured with German language support (-l deu)
🏢 Enhanced Entity Detection: Vodafone, Telekom, O2, DHL, Amazon now recognized
📁 Fixed Categorization: Vodafone → 11_Telekommunikation (not insurance!)
⚡ Graceful Fallback: pdftotext → OCR → empty string (30s timeout per file)
📋 More Document Types: Added Rezept, Kündigung, Mahnung patterns
🌍 Full Archive Scan: New script analyzed 693 files, improved 559 with OCR
📚 Documentation: Complete OCR-INTEGRATION.md guide

⚡ v4.5 - Advanced Archive Management

🧹 cleanup_old_structure: Removes old folder hierarchies, consolidates into standard categories
📁 optimize_folder_structure: Deletes empty folders, moves single-file categories to 99_Sonstiges
🤖 intelligent_rename: PDF content analysis for smart naming (extracts companies, document types)
📋 move_loose_files: Pattern-based categorization for loose files
🎯 Production Ready: Tested with 2,714 files, fully automated workflow
📚 Complete Documentation: PRODUCTION-SETUP.md and TESTFALL-PERPLEXITY.md

⚡ v4.4 - Performance Optimizations

🎯 Memory-Efficient Processing: Generator-based file scanning - no memory overflow
📊 Batch Processing: Processes 25 files per batch with automatic pauses
🛡️ Safety Limits: Configurable limits (500 files/year) prevent system crashes
🧹 Garbage Collection: Explicit memory cleanup between batches
⏸️ Progressive Processing: Resume-friendly architecture with state tracking
📈 Reduced Memory Footprint: 90% reduction vs previous versions

🤖 v4.3 - Autonomous Organization

🔄 auto_organize_folder: Analyzes AND organizes folders automatically
📥 process_downloads: Auto-files Downloads into archive with category detection
🧩 batch_organize_large: Processes >100 files in chunks with resume capability
📊 Smart Categorization: Auto-detects 10+ categories (Finanzen, Gesundheit, Reisen, etc.)
💾 State Persistence: Resume interrupted operations from JSON state files
🎯 Decade Detection: Automatically routes to Achziger/Neunziger/Nuller/Zehner/Zwanziger

� v4.2 - Full PDF OCR Support

📄 Scanned PDF Intelligence: Complete OCR solution for image-based PDFs
🤖 Automatic Detection: Smart fallback from text extraction to OCR (<50 chars triggers OCR)
📊 Quality Metrics: OCR confidence scores and quality assessment
⚡ Optimized Processing: PDF.js rendering + Tesseract OCR (up to 5 pages)
🌍 German Language Model: Pre-configured for local documents
📜 Apache-2.0 Licensed: No licensing issues with PDF.js (Mozilla)

�🚀 v4.1 - Quality & Performance Enhancements

🧪 Comprehensive Testing: 100 automated test cases with 99% pass rate
📊 Performance Metrics: Real-time processing stats and throughput reporting
🔍 Enhanced Validation: File size, name length, and type validation
🌐 Better Encoding Detection: Automatic UTF-8/Latin-1 switching with reporting
⚡ Optimized Processing: Average <100ms per file, batch <2000ms
🛡️ Robust Error Handling: Structured errors with actionable suggestions

🚀 v4.0 - Enterprise Features

🔍 Recursive Scanning: Deep folder analysis up to 10 levels
👥 Duplicate Detection: SHA256-based file deduplication
👁️ Preview Mode: Dry-run operations before execution
⏮️ Backup & Undo: Automatic backups with one-click restore
📊 Metadata Export: Export analysis results to JSON/CSV
🎯 Smart Filters: Filter by file type and keywords
📋 Copy Mode: Copy files instead of moving them
⚙️ Configurable Rules: Custom folder organization patterns
🎨 OCR Quality Feedback: Confidence scores for scanned documents
📈 Detailed Statistics: Comprehensive operation summaries

📦 Batch Document Processing

Folder Scanning: Analyze entire folders recursively in one operation
Batch Organization: Rename and move multiple files automatically
Smart Folder Structure: Auto-generate organized folder hierarchies
Workflow Automation: Scan → Analyze → Preview → Organize → Undo

�📄 Multi-Format Document Intelligence

Text Extraction: Extract text from PDF, DOCX, Pages, Images, TXT
Full PDF OCR: PDF.js + Tesseract.js for scanned PDFs (automatic fallback)
OCR Quality Scoring: Confidence metrics and quality assessment for all OCR operations
Multi-Encoding Support: Automatic detection and handling of UTF-8, Latin-1/ISO-8859-1
Robust Parsing: Handles null-bytes, special characters, and unusual file names
Smart Filename Suggestions: Automatically extracts:
- Scanner timestamps (preserves existing 2024-01-24_14-30-45 format)
- Document dates (DD.MM.YYYY, YYYY-MM-DD)
- Reference numbers (Invoice#, Customer#, Order#, Contract#)
- Keywords (Invoice, Contract, Company names)

🚀 Quick Start

System Requirements

For full PDF-OCR support, install these system tools:

# macOS (via Homebrew)
brew install tesseract tesseract-lang  # OCR engine with all languages
brew install poppler                    # PDF rendering tools (pdftoppm)

Prerequisites

You need one of these AI desktop clients:

Perplexity Desktop App (macOS/Windows)
Claude Desktop App (macOS/Windows)

Both support the Model Context Protocol (MCP) for extending AI capabilities with custom tools.

Installation

git clone https://github.com/AndreasDietzel/mcp-document-intelligence.git
cd mcp-document-intelligence
npm install
npm run build

Configuration

Add to your MCP client configuration:

For Perplexity Desktop: Location: ~/Library/Application Support/Perplexity/perplexity-config.json (macOS)

{
  "mcpServers": {
    "document-intelligence": {
      "command": "node",
      "args": ["/path/to/mcp-document-intelligence/build/index.js"],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

For Claude Desktop: Location: ~/Library/Application Support/Claude/claude_desktop_config.json (macOS)

{
  "mcpServers": {
    "document-intelligence": {
      "command": "node",
      "args": ["/path/to/mcp-document-intelligence/build/index.js"],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

Restart your AI client after updating the configuration.

MCP Filesystem Integration

This server works best alongside the official MCP Filesystem Server, which provides file browsing and management capabilities to your AI assistant.

Recommended Setup for Perplexity/Claude:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/path/to/your/documents"
      ]
    },
    "document-intelligence": {
      "command": "node",
      "args": ["/path/to/mcp-document-intelligence/build/index.js"]
    }
  }
}

With both servers configured, you can:

Ask your AI to find documents in your folders (filesystem server)
Analyze and organize them automatically (document-intelligence server)
Complete workflow handled by natural conversation with Perplexity/Claude

📋 Available Tools

`analyze_document`

Analyzes a single document and suggests an intelligent filename.

Input:

{
  "filePath": "/path/to/document.pdf"
}

Output Example:

{
  "originalFilename": "20240124_143045_scan.pdf",
  "suggestedFilename": "20240124_143045_RE-2024-1234_rechnung_telekom.pdf",
  "documentDate": "24.01.2024",
  "references": ["RE-2024-1234"],
  "keywords": ["rechnung", "telekom"],
  "scannerDatePreserved": true,
  "textLength": 2450,
  "preview": "Rechnung Nr. RE-2024-1234..."
}

`analyze_folder` ✨ Enhanced in v4.0

Analyzes ALL documents in a folder (batch processing with recursive scanning, duplicate detection, and filtering).

Input:

{
  "folderPath": "/path/to/folder",
  "recursive": true,
  "fileTypes": ["invoice", "contract"],
  "keywords": ["telekom", "vodafone"]
}

Output:

{
  "folderPath": "/path/to/folder",
  "totalFiles": 15,
  "duplicateGroups": [
    {
      "hash": "abc123...",
      "count": 3,
      "files": ["doc1.pdf", "doc1_copy.pdf", "duplicate.pdf"]
    }
  ],
  "documents": [
    { 
      "originalPath": "...", 
      "suggestedFilename": "...", 
      "ocrQuality": "good",
      "confidence": 0.95,
      "metadata": {...} 
    }
  ]
}

`suggest_folder_structure` ✨ NEW in v3.0

Suggests intelligent folder organization based on analyzed documents.

Input:

{
  "documents": [ /* array from analyze_folder */ ]
}

`auto_organize_folder` 🤖 NEW in v4.3

Analyzes AND organizes a folder automatically - combines analysis + rename/move in one step.

Input:

{
  "sourcePath": "/path/to/source",
  "archivePath": "/path/to/archive",
  "dryRun": false,
  "createCategories": true,
  "stateFile": "/tmp/state.json"
}

Output:

{
  "total": 150,
  "processed": 147,
  "moved": 145,
  "categorized": {
    "01_Finanzen": 45,
    "03_Gesundheit": 12,
    "06_Reisen": 23,
    "99_Sonstiges": 65
  },
  "errors": [{"file": "...", "error": "..."}]
}

`process_downloads` 📥 NEW in v4.3

Scans Downloads folder and automatically files documents into archive with year and category detection.

Input:

{
  "downloadsPath": "~/Downloads",
  "archiveBasePath": "/path/to/archive",
  "autoMove": false,
  "maxFiles": 50
}

Output:

{
  "scanned": 23,
  "suggestions": [
    {
      "from": "~/Downloads/Rechnung.pdf",
      "to": "/archive/Zwanziger/2025/01_Finanzen/2025-03-15_Rechnung_Telekom.pdf",
      "year": 2025,
      "category": "Finanzen"
    }
  ],
  "filed": [],
  "errors": []
}

`batch_organize_large` 🧩 NEW in v4.3

Processes large folders (>100 files) in chunks with resume capability.

Input:

{
  "folderPath": "/path/to/large/folder",
  "targetArchivePath": "/path/to/archive",
  "chunkSize": 50,
  "stateFilePath": "/tmp/batch-state.json"
}

Output:

{
  "chunkCompleted": true,
  "totalFiles": 500,
  "processedFiles": 50,
  "successCount": 48,
  "errorCount": 2,
  "percentComplete": 10,
  "nextChunkExists": true,
  "stateFilePath": "/tmp/batch-state.json"
}

Output:

{
  "structure": {
    "2024": {
      "Rechnungen": ["Telekom", "Vodafone"],
      "Vertraege": ["..."]
    }
  },
  "assignments": [
    {
      "originalPath": "/path/scan001.pdf",
      "targetFolder": "2024/Rechnungen/Telekom",
      "newFilename": "2024-01-24_RE-123_rechnung_telekom.pdf"
    }
  ]
}

`batch_organize` ✨ Enhanced in v4.0

Executes batch renaming and moving/copying of files with automatic backup.

Input:

{
  "baseFolder": "/path/to/organized",
  "mode": "move",
  "createBackup": true,
  "operations": [
    {
      "originalPath": "/path/scan001.pdf",
      "targetFolder": "2024/Rechnungen/Telekom",
      "newFilename": "2024-01-24_RE-123_rechnung_telekom.pdf"
    }
  ]
}

Output:

{
  "success": true,
  "mode": "move",
  "filesProcessed": 15,
  "filesFailed": 0,
  "foldersCreated": 5,
  "backupCreated": true,
  "backupPath": "/path/.backup_2024-01-24T10-30-00.json",
  "results": [...]
}

`preview_organization` ✨ NEW in v4.0

Shows a dry-run preview of what would happen without making changes.

Input:

{
  "baseFolder": "/path/to/organized",
  "operations": [ /* same as batch_organize */ ]
}

Output:

{
  "preview": [
    {
      "action": "move",
      "from": "/path/scan001.pdf",
      "to": "/path/organized/2024/Rechnungen/Telekom/2024-01-24_RE-123.pdf",
      "status": "ok"
    }
  ],
  "warnings": [],
  "stats": {
    "totalFiles": 15,
    "foldersToCreate": ["2024/Rechnungen/Telekom"],
    "conflicts": 0,
    "missingFiles": 0
  },
  "safeToExecute": true
}

`undo_last_organization` ✨ NEW in v4.0

Restores the last organization operation from automatic backup.

Input:

{
  "baseFolder": "/path/to/organized"
}

Output:

{
  "success": true,
  "restored": 15,
  "failed": 0,
  "backupFile": "/path/.backup_2024-01-24T10-30-00.json"
}

`export_metadata` ✨ NEW in v4.0

Exports analyzed document metadata to JSON or CSV format.

Input:

{
  "documents": [ /* array from analyze_folder */ ],
  "format": "csv"
}

Output (CSV):

Filename,Path,Date,References,Keywords,OCR Quality,Confidence,Type
scan001.pdf,/path/scan001.pdf,24.01.2024,RE-2024-1234,rechnung;telekom,good,0.95,invoice
...

🔧 Use Cases

Single Document Analysis

analyze_document with filePath: "/path/to/scanned_invoice.pdf"

→ Extracts invoice number, date, company name and suggests: 2024-01-24_INV-2024-001_rechnung_telekom.pdf

Advanced Batch Organization (v4.0)

Example conversation with Perplexity or Claude:

You: "Analyze all documents recursively in my 2026 folder, find duplicates"

AI (using analyze_folder with recursive=true):
   → Scanned 10 levels deep
   → Found 150 documents
   → Detected 12 duplicates (3 groups)
   → Extracted: dates, invoice numbers, companies
   → OCR quality: 95% confidence average

AI (using suggest_folder_structure):
   → Proposes: 2026/Rechnungen/Telekom, 2026/Vertraege/Vodafone, etc.
   → Shows: Complete list of file renames and target folders

AI (using preview_organization):
   → Preview: 150 files will be moved
   → Folders to create: 8
   → Conflicts: 0
   → Safe to execute: YES

AI: "I found 150 documents (12 duplicates). Should I organize them into 
     2026/Rechnungen, 2026/Vertraege with smart filenames?"

You: "Yes, but copy instead of moving"

AI (using batch_organize with mode="copy", createBackup=true):
   → Copies all files (originals preserved)
   → Creates folder structure
   → Backup created for undo
   → Processes everything automatically

AI: "Done! Organized 150 documents, created 8 folders, backup saved.
     15 files processed, 0 failed. Type 'undo' to revert."

You: "Export the metadata as CSV"

AI (using export_metadata with format="csv"):
   → Exports all document metadata
   → Includes: filename, date, references, keywords, OCR quality

AI: "CSV exported with all metadata for 150 documents."

Complete Workflow v4.0:

Scanner saves to "Inbox" folder
Tell Perplexity/Claude to analyze recursively + find duplicates
Preview changes before execution
Confirm with copy or move mode
Files auto-organized with automatic backup
Export metadata for records
Undo anytime if needed

Workflow Automation

Before: Manual sorting of 100+ scanned documents
After: One command → Preview → Organization in seconds
Safety: Automatic backups, preview mode, undo function
Perfect for: Tax documents, invoices, contracts, receipts, archives

🛠️ Technical Details

Dependencies

pdf-parse: PDF text extraction
mammoth: DOCX document processing
adm-zip: Pages document extraction
tesseract.js: OCR for scanned documents and images
@modelcontextprotocol/sdk: MCP protocol implementation

File Structure

mcp-document-intelligence/
├── src/
│   └── index.ts          # Main server implementation
├── build/                # Compiled output
├── package.json
├── tsconfig.json
└── README.md

Filename Pattern Recognition

The analyzer recognizes:

Scanner timestamps: YYYY-MM-DD_HH-MM-SS or YYYYMMDD_HHMMSS
Document dates: DD.MM.YYYY, YYYY-MM-DD
Reference patterns:
- Rechnungs-Nr: XXX / Invoice: XXX
- Kunden-Nr: XXX / Customer: XXX
- Bestell-Nr: XXX / Order: XXX
- Vertrag-Nr: XXX / Contract: XXX
Keywords: Invoice, Contract, Offer, Order, common company names

Folder Structure Generation

Automatically groups documents by:

Year: Extracted from document date
Category: Rechnungen, Verträge, Angebote, Mahnungen, etc.
Company: Telekom, Vodafone, Amazon, PayPal, Banks, etc.

🔒 Privacy & Security

✅ All data stays local - No external API calls for personal data
✅ OCR processing on-device - Tesseract.js runs locally
✅ No data transmission - All processing happens locally
✅ No logging of document content

🌍 Encoding & International Support

✅ Automatic Encoding Detection - UTF-8 and Latin-1/ISO-8859-1
✅ International Characters - Full Unicode support (日本語, 中文, العربية, עברית)
✅ German Umlauts - Native support for äöüÄÖÜß
✅ Special Characters - Handles pipes, colons, quotes in filenames
✅ Null-Byte Handling - Automatically cleans corrupted files
✅ Encoding Info - Reports detected encoding in analysis results

🧪 Quality Assurance

This project maintains high quality standards with comprehensive testing:

100 Automated Tests covering all functionality
99% Test Pass Rate (99/100 tests passing)
ISO 25010 Compliant - Quality characteristics validated
Performance Benchmarks - <100ms per file average
Security Tested - Path traversal and data privacy verified

See TEST-DOCUMENTATION.md for detailed test coverage and results.

🚀 Roadmap

[x] Multi-format document support (PDF, DOCX, Pages, Images, TXT)
[x] Batch processing support
[x] Auto-filing to folders based on content
[x] Recursive folder scanning
[x] Duplicate detection with SHA256
[x] Preview mode (dry-run)
[x] Backup & Undo functionality
[x] Metadata export (JSON/CSV)
[x] Copy vs Move modes
[x] OCR quality feedback
[ ] Configurable naming templates
[ ] Custom reference number patterns
[ ] Excel/CSV document support
[ ] Integration with document management systems
[ ] Machine learning for improved categorization

📄 License

MIT License - see LICENSE file

🙏 Acknowledgments

Built with Model Context Protocol SDK
Inspired by and based on concepts from MCP Filesystem Server
PDF parsing by pdf-parse
OCR by Tesseract.js

Made for intelligent document workflows 📄✨

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

mcp-document-intelligence

README

MCP Document Intelligence Server

🎯 Features

🔍 NEW in v4.6 - Automatic OCR Integration

⚡ v4.5 - Advanced Archive Management

⚡ v4.4 - Performance Optimizations

🤖 v4.3 - Autonomous Organization

� v4.2 - Full PDF OCR Support

�🚀 v4.1 - Quality & Performance Enhancements

🚀 v4.0 - Enterprise Features

📦 Batch Document Processing

�📄 Multi-Format Document Intelligence

🚀 Quick Start

System Requirements

Prerequisites

Installation

Configuration

MCP Filesystem Integration

📋 Available Tools

analyze_document

analyze_folder ✨ Enhanced in v4.0

suggest_folder_structure ✨ NEW in v3.0

auto_organize_folder 🤖 NEW in v4.3

process_downloads 📥 NEW in v4.3

batch_organize_large 🧩 NEW in v4.3

batch_organize ✨ Enhanced in v4.0

preview_organization ✨ NEW in v4.0

undo_last_organization ✨ NEW in v4.0

export_metadata ✨ NEW in v4.0

🔧 Use Cases

Single Document Analysis

Advanced Batch Organization (v4.0)

Workflow Automation

🛠️ Technical Details

Dependencies

File Structure

Filename Pattern Recognition

Folder Structure Generation

🔒 Privacy & Security

🌍 Encoding & International Support

🧪 Quality Assurance

🚀 Roadmap

📄 License

🙏 Acknowledgments

Recommended Servers

`analyze_document`

`analyze_folder` ✨ Enhanced in v4.0

`suggest_folder_structure` ✨ NEW in v3.0

`auto_organize_folder` 🤖 NEW in v4.3

`process_downloads` 📥 NEW in v4.3

`batch_organize_large` 🧩 NEW in v4.3

`batch_organize` ✨ Enhanced in v4.0

`preview_organization` ✨ NEW in v4.0

`undo_last_organization` ✨ NEW in v4.0

`export_metadata` ✨ NEW in v4.0