mcp-document-intelligence

mcp-document-intelligence

Enables AI assistants to analyze, rename, categorize, and organize local documents using OCR, metadata extraction, and batch processing, with preview mode and undo support.

Category
Visit Server

README

+--------------------------------------------------------------------------------+
|                                                                                |
|  __  __  ____ ____        ____                                _                |
| |  \/  |/ ___|  _ \      |  _ \  ___   ___ _   _ _ __ ___   | |_ ___          |
| | |\/| | |   | |_) |_____| | | |/ _ \ / __| | | | '_ ` _ \  | __/ _ \         |
| | |  | | |___|  __/_____| |_| | (_) | (__| |_| | | | | | | | ||  __/         |
| |_|  |_|\____|_|        |____/ \___/ \___|\__,_|_| |_| |_|  \__\___|         |
|                                                                                |
|      ____       _       _ _ _                       _                          |
|     |_ _|_ __  | |_ ___| | (_) __ _  ___ _ __   ___| |                         |
|      | || '_ \ | __/ _ \ | | |/ _` |/ _ \ '_ \ / __| |                         |
|      | || | | || ||  __/ | | | (_| |  __/ | | | (__| |                         |
|     |___|_| |_| \__\___|_|_|_|\__, |\___|_| |_|\___|_|                         |
|                                |___/                                           |
|                                                                                |
|              MCP SERVER  |  OCR  |  ANALYZE  |  ORGANIZE                       |
|                                                                                |
+--------------------------------------------------------------------------------+

MCP Document Intelligence Server

Model Context Protocol Server with Advanced Batch Processing & Intelligent Document Organization

๐ŸŽฏ Built for Perplexity Desktop, Claude Desktop, and other MCP-compatible clients โ€“ Supercharge your AI assistant with enterprise-grade document intelligence.

This MCP server analyzes, renames, categorizes, and organizes documents through natural-language requests in your AI client. It scans folders recursively, detects duplicates, extracts metadata from PDFs, DOCX, Pages, images, and text files, and uses OCR for scanned documents. Preview mode, backups, undo support, metadata export, and memory-optimized batch processing make it practical for large personal archives. The result is an AI-driven document workflow that stays local, fast, and automatable.

MCP License Version Tests Performance OCR Perplexity Claude


๐ŸŽฏ Features

๐Ÿ” NEW in v4.6 - Automatic OCR Integration

  • ๐Ÿ“ธ Auto-OCR for Scanned PDFs: Automatically falls back to Tesseract OCR when text extraction yields < 50 characters
  • ๐Ÿ–ผ๏ธ Image Support: Process .jpg, .jpeg, .png files with OCR in all tools
  • ๐Ÿ‡ฉ๐Ÿ‡ช German Language: Pre-configured with German language support (-l deu)
  • ๐Ÿข Enhanced Entity Detection: Vodafone, Telekom, O2, DHL, Amazon now recognized
  • ๐Ÿ“ Fixed Categorization: Vodafone โ†’ 11_Telekommunikation (not insurance!)
  • โšก Graceful Fallback: pdftotext โ†’ OCR โ†’ empty string (30s timeout per file)
  • ๐Ÿ“‹ More Document Types: Added Rezept, Kรผndigung, Mahnung patterns
  • ๐ŸŒ Full Archive Scan: New script analyzed 693 files, improved 559 with OCR
  • ๐Ÿ“š Documentation: Complete OCR-INTEGRATION.md guide

โšก v4.5 - Advanced Archive Management

  • ๐Ÿงน cleanup_old_structure: Removes old folder hierarchies, consolidates into standard categories
  • ๐Ÿ“ optimize_folder_structure: Deletes empty folders, moves single-file categories to 99_Sonstiges
  • ๐Ÿค– intelligent_rename: PDF content analysis for smart naming (extracts companies, document types)
  • ๐Ÿ“‹ move_loose_files: Pattern-based categorization for loose files
  • ๐ŸŽฏ Production Ready: Tested with 2,714 files, fully automated workflow
  • ๐Ÿ“š Complete Documentation: PRODUCTION-SETUP.md and TESTFALL-PERPLEXITY.md

โšก v4.4 - Performance Optimizations

  • ๐ŸŽฏ Memory-Efficient Processing: Generator-based file scanning - no memory overflow
  • ๐Ÿ“Š Batch Processing: Processes 25 files per batch with automatic pauses
  • ๐Ÿ›ก๏ธ Safety Limits: Configurable limits (500 files/year) prevent system crashes
  • ๐Ÿงน Garbage Collection: Explicit memory cleanup between batches
  • โธ๏ธ Progressive Processing: Resume-friendly architecture with state tracking
  • ๐Ÿ“ˆ Reduced Memory Footprint: 90% reduction vs previous versions

๐Ÿค– v4.3 - Autonomous Organization

  • ๐Ÿ”„ auto_organize_folder: Analyzes AND organizes folders automatically
  • ๐Ÿ“ฅ process_downloads: Auto-files Downloads into archive with category detection
  • ๐Ÿงฉ batch_organize_large: Processes >100 files in chunks with resume capability
  • ๐Ÿ“Š Smart Categorization: Auto-detects 10+ categories (Finanzen, Gesundheit, Reisen, etc.)
  • ๐Ÿ’พ State Persistence: Resume interrupted operations from JSON state files
  • ๐ŸŽฏ Decade Detection: Automatically routes to Achziger/Neunziger/Nuller/Zehner/Zwanziger

๏ฟฝ v4.2 - Full PDF OCR Support

  • ๐Ÿ“„ Scanned PDF Intelligence: Complete OCR solution for image-based PDFs
  • ๐Ÿค– Automatic Detection: Smart fallback from text extraction to OCR (<50 chars triggers OCR)
  • ๐Ÿ“Š Quality Metrics: OCR confidence scores and quality assessment
  • โšก Optimized Processing: PDF.js rendering + Tesseract OCR (up to 5 pages)
  • ๐ŸŒ German Language Model: Pre-configured for local documents
  • ๐Ÿ“œ Apache-2.0 Licensed: No licensing issues with PDF.js (Mozilla)

๏ฟฝ๐Ÿš€ v4.1 - Quality & Performance Enhancements

  • ๐Ÿงช Comprehensive Testing: 100 automated test cases with 99% pass rate
  • ๐Ÿ“Š Performance Metrics: Real-time processing stats and throughput reporting
  • ๐Ÿ” Enhanced Validation: File size, name length, and type validation
  • ๐ŸŒ Better Encoding Detection: Automatic UTF-8/Latin-1 switching with reporting
  • โšก Optimized Processing: Average <100ms per file, batch <2000ms
  • ๐Ÿ›ก๏ธ Robust Error Handling: Structured errors with actionable suggestions

๐Ÿš€ v4.0 - Enterprise Features

  • ๐Ÿ” Recursive Scanning: Deep folder analysis up to 10 levels
  • ๐Ÿ‘ฅ Duplicate Detection: SHA256-based file deduplication
  • ๐Ÿ‘๏ธ Preview Mode: Dry-run operations before execution
  • โฎ๏ธ Backup & Undo: Automatic backups with one-click restore
  • ๐Ÿ“Š Metadata Export: Export analysis results to JSON/CSV
  • ๐ŸŽฏ Smart Filters: Filter by file type and keywords
  • ๐Ÿ“‹ Copy Mode: Copy files instead of moving them
  • โš™๏ธ Configurable Rules: Custom folder organization patterns
  • ๐ŸŽจ OCR Quality Feedback: Confidence scores for scanned documents
  • ๐Ÿ“ˆ Detailed Statistics: Comprehensive operation summaries

๐Ÿ“ฆ Batch Document Processing

  • Folder Scanning: Analyze entire folders recursively in one operation
  • Batch Organization: Rename and move multiple files automatically
  • Smart Folder Structure: Auto-generate organized folder hierarchies
  • Workflow Automation: Scan โ†’ Analyze โ†’ Preview โ†’ Organize โ†’ Undo

๏ฟฝ๐Ÿ“„ Multi-Format Document Intelligence

  • Text Extraction: Extract text from PDF, DOCX, Pages, Images, TXT
  • Full PDF OCR: PDF.js + Tesseract.js for scanned PDFs (automatic fallback)
  • OCR Quality Scoring: Confidence metrics and quality assessment for all OCR operations
  • Multi-Encoding Support: Automatic detection and handling of UTF-8, Latin-1/ISO-8859-1
  • Robust Parsing: Handles null-bytes, special characters, and unusual file names
  • Smart Filename Suggestions: Automatically extracts:
    • Scanner timestamps (preserves existing 2024-01-24_14-30-45 format)
    • Document dates (DD.MM.YYYY, YYYY-MM-DD)
    • Reference numbers (Invoice#, Customer#, Order#, Contract#)
    • Keywords (Invoice, Contract, Company names)

๐Ÿš€ Quick Start

System Requirements

For full PDF-OCR support, install these system tools:

# macOS (via Homebrew)
brew install tesseract tesseract-lang  # OCR engine with all languages
brew install poppler                    # PDF rendering tools (pdftoppm)

Prerequisites

You need one of these AI desktop clients:

Both support the Model Context Protocol (MCP) for extending AI capabilities with custom tools.

Installation

git clone https://github.com/AndreasDietzel/mcp-document-intelligence.git
cd mcp-document-intelligence
npm install
npm run build

Configuration

Add to your MCP client configuration:

For Perplexity Desktop: Location: ~/Library/Application Support/Perplexity/perplexity-config.json (macOS)

{
  "mcpServers": {
    "document-intelligence": {
      "command": "node",
      "args": ["/path/to/mcp-document-intelligence/build/index.js"],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

For Claude Desktop: Location: ~/Library/Application Support/Claude/claude_desktop_config.json (macOS)

{
  "mcpServers": {
    "document-intelligence": {
      "command": "node",
      "args": ["/path/to/mcp-document-intelligence/build/index.js"],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

Restart your AI client after updating the configuration.

MCP Filesystem Integration

This server works best alongside the official MCP Filesystem Server, which provides file browsing and management capabilities to your AI assistant.

Recommended Setup for Perplexity/Claude:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/path/to/your/documents"
      ]
    },
    "document-intelligence": {
      "command": "node",
      "args": ["/path/to/mcp-document-intelligence/build/index.js"]
    }
  }
}

With both servers configured, you can:

  1. Ask your AI to find documents in your folders (filesystem server)
  2. Analyze and organize them automatically (document-intelligence server)
  3. Complete workflow handled by natural conversation with Perplexity/Claude

๐Ÿ“‹ Available Tools

analyze_document

Analyzes a single document and suggests an intelligent filename.

Input:

{
  "filePath": "/path/to/document.pdf"
}

Output Example:

{
  "originalFilename": "20240124_143045_scan.pdf",
  "suggestedFilename": "20240124_143045_RE-2024-1234_rechnung_telekom.pdf",
  "documentDate": "24.01.2024",
  "references": ["RE-2024-1234"],
  "keywords": ["rechnung", "telekom"],
  "scannerDatePreserved": true,
  "textLength": 2450,
  "preview": "Rechnung Nr. RE-2024-1234..."
}

analyze_folder โœจ Enhanced in v4.0

Analyzes ALL documents in a folder (batch processing with recursive scanning, duplicate detection, and filtering).

Input:

{
  "folderPath": "/path/to/folder",
  "recursive": true,
  "fileTypes": ["invoice", "contract"],
  "keywords": ["telekom", "vodafone"]
}

Output:

{
  "folderPath": "/path/to/folder",
  "totalFiles": 15,
  "duplicateGroups": [
    {
      "hash": "abc123...",
      "count": 3,
      "files": ["doc1.pdf", "doc1_copy.pdf", "duplicate.pdf"]
    }
  ],
  "documents": [
    { 
      "originalPath": "...", 
      "suggestedFilename": "...", 
      "ocrQuality": "good",
      "confidence": 0.95,
      "metadata": {...} 
    }
  ]
}

suggest_folder_structure โœจ NEW in v3.0

Suggests intelligent folder organization based on analyzed documents.

Input:

{
  "documents": [ /* array from analyze_folder */ ]
}

auto_organize_folder ๐Ÿค– NEW in v4.3

Analyzes AND organizes a folder automatically - combines analysis + rename/move in one step.

Input:

{
  "sourcePath": "/path/to/source",
  "archivePath": "/path/to/archive",
  "dryRun": false,
  "createCategories": true,
  "stateFile": "/tmp/state.json"
}

Output:

{
  "total": 150,
  "processed": 147,
  "moved": 145,
  "categorized": {
    "01_Finanzen": 45,
    "03_Gesundheit": 12,
    "06_Reisen": 23,
    "99_Sonstiges": 65
  },
  "errors": [{"file": "...", "error": "..."}]
}

process_downloads ๐Ÿ“ฅ NEW in v4.3

Scans Downloads folder and automatically files documents into archive with year and category detection.

Input:

{
  "downloadsPath": "~/Downloads",
  "archiveBasePath": "/path/to/archive",
  "autoMove": false,
  "maxFiles": 50
}

Output:

{
  "scanned": 23,
  "suggestions": [
    {
      "from": "~/Downloads/Rechnung.pdf",
      "to": "/archive/Zwanziger/2025/01_Finanzen/2025-03-15_Rechnung_Telekom.pdf",
      "year": 2025,
      "category": "Finanzen"
    }
  ],
  "filed": [],
  "errors": []
}

batch_organize_large ๐Ÿงฉ NEW in v4.3

Processes large folders (>100 files) in chunks with resume capability.

Input:

{
  "folderPath": "/path/to/large/folder",
  "targetArchivePath": "/path/to/archive",
  "chunkSize": 50,
  "stateFilePath": "/tmp/batch-state.json"
}

Output:

{
  "chunkCompleted": true,
  "totalFiles": 500,
  "processedFiles": 50,
  "successCount": 48,
  "errorCount": 2,
  "percentComplete": 10,
  "nextChunkExists": true,
  "stateFilePath": "/tmp/batch-state.json"
}

Output:

{
  "structure": {
    "2024": {
      "Rechnungen": ["Telekom", "Vodafone"],
      "Vertraege": ["..."]
    }
  },
  "assignments": [
    {
      "originalPath": "/path/scan001.pdf",
      "targetFolder": "2024/Rechnungen/Telekom",
      "newFilename": "2024-01-24_RE-123_rechnung_telekom.pdf"
    }
  ]
}

batch_organize โœจ Enhanced in v4.0

Executes batch renaming and moving/copying of files with automatic backup.

Input:

{
  "baseFolder": "/path/to/organized",
  "mode": "move",
  "createBackup": true,
  "operations": [
    {
      "originalPath": "/path/scan001.pdf",
      "targetFolder": "2024/Rechnungen/Telekom",
      "newFilename": "2024-01-24_RE-123_rechnung_telekom.pdf"
    }
  ]
}

Output:

{
  "success": true,
  "mode": "move",
  "filesProcessed": 15,
  "filesFailed": 0,
  "foldersCreated": 5,
  "backupCreated": true,
  "backupPath": "/path/.backup_2024-01-24T10-30-00.json",
  "results": [...]
}

preview_organization โœจ NEW in v4.0

Shows a dry-run preview of what would happen without making changes.

Input:

{
  "baseFolder": "/path/to/organized",
  "operations": [ /* same as batch_organize */ ]
}

Output:

{
  "preview": [
    {
      "action": "move",
      "from": "/path/scan001.pdf",
      "to": "/path/organized/2024/Rechnungen/Telekom/2024-01-24_RE-123.pdf",
      "status": "ok"
    }
  ],
  "warnings": [],
  "stats": {
    "totalFiles": 15,
    "foldersToCreate": ["2024/Rechnungen/Telekom"],
    "conflicts": 0,
    "missingFiles": 0
  },
  "safeToExecute": true
}

undo_last_organization โœจ NEW in v4.0

Restores the last organization operation from automatic backup.

Input:

{
  "baseFolder": "/path/to/organized"
}

Output:

{
  "success": true,
  "restored": 15,
  "failed": 0,
  "backupFile": "/path/.backup_2024-01-24T10-30-00.json"
}

export_metadata โœจ NEW in v4.0

Exports analyzed document metadata to JSON or CSV format.

Input:

{
  "documents": [ /* array from analyze_folder */ ],
  "format": "csv"
}

Output (CSV):

Filename,Path,Date,References,Keywords,OCR Quality,Confidence,Type
scan001.pdf,/path/scan001.pdf,24.01.2024,RE-2024-1234,rechnung;telekom,good,0.95,invoice
...

๐Ÿ”ง Use Cases

Single Document Analysis

analyze_document with filePath: "/path/to/scanned_invoice.pdf"

โ†’ Extracts invoice number, date, company name and suggests: 2024-01-24_INV-2024-001_rechnung_telekom.pdf

Advanced Batch Organization (v4.0)

Example conversation with Perplexity or Claude:

You: "Analyze all documents recursively in my 2026 folder, find duplicates"

AI (using analyze_folder with recursive=true):
   โ†’ Scanned 10 levels deep
   โ†’ Found 150 documents
   โ†’ Detected 12 duplicates (3 groups)
   โ†’ Extracted: dates, invoice numbers, companies
   โ†’ OCR quality: 95% confidence average

AI (using suggest_folder_structure):
   โ†’ Proposes: 2026/Rechnungen/Telekom, 2026/Vertraege/Vodafone, etc.
   โ†’ Shows: Complete list of file renames and target folders

AI (using preview_organization):
   โ†’ Preview: 150 files will be moved
   โ†’ Folders to create: 8
   โ†’ Conflicts: 0
   โ†’ Safe to execute: YES

AI: "I found 150 documents (12 duplicates). Should I organize them into 
     2026/Rechnungen, 2026/Vertraege with smart filenames?"

You: "Yes, but copy instead of moving"

AI (using batch_organize with mode="copy", createBackup=true):
   โ†’ Copies all files (originals preserved)
   โ†’ Creates folder structure
   โ†’ Backup created for undo
   โ†’ Processes everything automatically

AI: "Done! Organized 150 documents, created 8 folders, backup saved.
     15 files processed, 0 failed. Type 'undo' to revert."

You: "Export the metadata as CSV"

AI (using export_metadata with format="csv"):
   โ†’ Exports all document metadata
   โ†’ Includes: filename, date, references, keywords, OCR quality

AI: "CSV exported with all metadata for 150 documents."

Complete Workflow v4.0:

  1. Scanner saves to "Inbox" folder
  2. Tell Perplexity/Claude to analyze recursively + find duplicates
  3. Preview changes before execution
  4. Confirm with copy or move mode
  5. Files auto-organized with automatic backup
  6. Export metadata for records
  7. Undo anytime if needed

Workflow Automation

  • Before: Manual sorting of 100+ scanned documents
  • After: One command โ†’ Preview โ†’ Organization in seconds
  • Safety: Automatic backups, preview mode, undo function
  • Perfect for: Tax documents, invoices, contracts, receipts, archives

๐Ÿ› ๏ธ Technical Details

Dependencies

  • pdf-parse: PDF text extraction
  • mammoth: DOCX document processing
  • adm-zip: Pages document extraction
  • tesseract.js: OCR for scanned documents and images
  • @modelcontextprotocol/sdk: MCP protocol implementation

File Structure

mcp-document-intelligence/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ index.ts          # Main server implementation
โ”œโ”€โ”€ build/                # Compiled output
โ”œโ”€โ”€ package.json
โ”œโ”€โ”€ tsconfig.json
โ””โ”€โ”€ README.md

Filename Pattern Recognition

The analyzer recognizes:

  • Scanner timestamps: YYYY-MM-DD_HH-MM-SS or YYYYMMDD_HHMMSS
  • Document dates: DD.MM.YYYY, YYYY-MM-DD
  • Reference patterns:
    • Rechnungs-Nr: XXX / Invoice: XXX
    • Kunden-Nr: XXX / Customer: XXX
    • Bestell-Nr: XXX / Order: XXX
    • Vertrag-Nr: XXX / Contract: XXX
  • Keywords: Invoice, Contract, Offer, Order, common company names

Folder Structure Generation

Automatically groups documents by:

  • Year: Extracted from document date
  • Category: Rechnungen, Vertrรคge, Angebote, Mahnungen, etc.
  • Company: Telekom, Vodafone, Amazon, PayPal, Banks, etc.

๐Ÿ”’ Privacy & Security

  • โœ… All data stays local - No external API calls for personal data
  • โœ… OCR processing on-device - Tesseract.js runs locally
  • โœ… No data transmission - All processing happens locally
  • โœ… No logging of document content

๐ŸŒ Encoding & International Support

  • โœ… Automatic Encoding Detection - UTF-8 and Latin-1/ISO-8859-1
  • โœ… International Characters - Full Unicode support (ๆ—ฅๆœฌ่ชž, ไธญๆ–‡, ุงู„ุนุฑุจูŠุฉ, ืขื‘ืจื™ืช)
  • โœ… German Umlauts - Native support for รครถรผร„ร–รœรŸ
  • โœ… Special Characters - Handles pipes, colons, quotes in filenames
  • โœ… Null-Byte Handling - Automatically cleans corrupted files
  • โœ… Encoding Info - Reports detected encoding in analysis results

๐Ÿงช Quality Assurance

This project maintains high quality standards with comprehensive testing:

  • 100 Automated Tests covering all functionality
  • 99% Test Pass Rate (99/100 tests passing)
  • ISO 25010 Compliant - Quality characteristics validated
  • Performance Benchmarks - <100ms per file average
  • Security Tested - Path traversal and data privacy verified

See TEST-DOCUMENTATION.md for detailed test coverage and results.


๐Ÿš€ Roadmap

  • [x] Multi-format document support (PDF, DOCX, Pages, Images, TXT)
  • [x] Batch processing support
  • [x] Auto-filing to folders based on content
  • [x] Recursive folder scanning
  • [x] Duplicate detection with SHA256
  • [x] Preview mode (dry-run)
  • [x] Backup & Undo functionality
  • [x] Metadata export (JSON/CSV)
  • [x] Copy vs Move modes
  • [x] OCR quality feedback
  • [ ] Configurable naming templates
  • [ ] Custom reference number patterns
  • [ ] Excel/CSV document support
  • [ ] Integration with document management systems
  • [ ] Machine learning for improved categorization

๐Ÿ“„ License

MIT License - see LICENSE file


๐Ÿ™ Acknowledgments


Made for intelligent document workflows ๐Ÿ“„โœจ

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured