doc-ingestor-mcp

doc-ingestor-mcp

An MCP server that uses Docling to convert PDFs, Office documents, images, audio, and more into clean Markdown for AI processing and RAG pipelines.

Category
Visit Server

README

doc-ingestor-mcp

An MCP (Model Context Protocol) server that provides intelligent document ingestion capabilities using the Docling toolkit. Convert any document (PDF, DOCX, images, HTML, etc.) into clean Markdown for AI processing and RAG pipelines.

Features

  • Universal File Support: PDFs, DOCX/XLSX/PPTX, images (PNG/JPEG/TIFF/BMP/WEBP), HTML, Markdown, CSV, audio files, and more
  • Flexible Input: Process local files or remote URLs
  • Multiple Processing Pipelines: Standard (fast, high-quality), VLM (vision-language models), ASR (audio transcription)
  • Intelligent Auto-Detection: Automatically selects optimal settings based on file type and content
  • Queue Management: Handles concurrent requests with proper job queuing
  • Mac M2 Optimized: Efficient memory usage and MLX acceleration support
  • Clean Markdown Output: High-quality structured text ready for AI consumption

Installation

Prerequisites

  • Python 3.9+ (recommended: 3.11+)
  • macOS (optimized for Apple Silicon M2)
  • 8GB+ RAM recommended

Setup

  1. Clone and install dependencies:
git clone <repository-url>
cd doc-ingestor-mcp
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Install Docling with Mac optimizations:
# Core Docling with MLX acceleration for Apple Silicon
pip install docling
# For MLX support (Apple Silicon only):
pip install docling[mlx]
# Optional: additional OCR engines
pip install easyocr
# Install tesseract via homebrew: brew install tesseract
  1. Start the MCP server:
python -m doc_ingestor_mcp

The server will start and listen for MCP connections using stdio transport.

MCP Tools

The server provides the following MCP tools:

convert_document

Converts any supported document to Markdown.

Parameters:

  • source (required): File path or URL to the document
  • pipeline (optional): Processing pipeline - "standard", "vlm", or "asr"
  • options (optional): Additional processing options

Example:

{
  "name": "convert_document",
  "arguments": {
    "source": "https://arxiv.org/pdf/2408.09869",
    "pipeline": "standard"
  }
}

Response:

{
  "content": [
    {
      "type": "text",
      "text": "# Document Title\n\nConverted markdown content here..."
    }
  ]
}

convert_document_advanced

Advanced conversion with detailed configuration options.

Parameters:

  • source (required): File path or URL
  • pipeline (optional): "standard", "vlm", "asr"
  • ocr_enabled (optional): Enable/disable OCR (default: auto-detect)
  • ocr_language (optional): OCR language codes (e.g., "eng,spa")
  • table_mode (optional): "fast" or "accurate"
  • pdf_backend (optional): "dlparse_v4" or "pypdfium2"
  • enable_enrichments (optional): Enable code/formula/picture enrichments

Example:

{
  "name": "convert_document_advanced",
  "arguments": {
    "source": "./scanned-document.pdf",
    "pipeline": "standard",
    "ocr_enabled": true,
    "ocr_language": "eng",
    "table_mode": "accurate"
  }
}

get_processing_status

Check the status of ongoing conversions (useful for large files).

Parameters:

  • job_id (required): Job identifier returned from conversion requests

list_supported_formats

Returns all supported input and output formats.

Response:

{
  "input_formats": ["pdf", "docx", "xlsx", "pptx", "png", "jpeg", "html", "md", "csv", "mp3", "wav"],
  "output_formats": ["markdown", "html", "json", "text", "doctags"],
  "pipelines": ["standard", "vlm", "asr"]
}

Usage Examples

Basic PDF Conversion

{
  "name": "convert_document",
  "arguments": {
    "source": "./research-paper.pdf"
  }
}

URL-based Conversion with VLM Pipeline

{
  "name": "convert_document",
  "arguments": {
    "source": "https://example.com/complex-document.pdf",
    "pipeline": "vlm"
  }
}

Audio Transcription

{
  "name": "convert_document",
  "arguments": {
    "source": "./meeting-recording.mp3",
    "pipeline": "asr"
  }
}

Scanned Document with OCR

{
  "name": "convert_document_advanced",
  "arguments": {
    "source": "./scanned-invoice.pdf",
    "ocr_enabled": true,
    "ocr_language": "eng",
    "table_mode": "accurate"
  }
}

Pipeline Selection Guide

Standard Pipeline (Default)

  • Best for: Born-digital PDFs, Office documents, clean layouts
  • Features: Advanced layout analysis, table structure recovery, optional OCR
  • Performance: Fast, memory-efficient
  • Use when: Document has programmatic text and standard layouts

VLM Pipeline

  • Best for: Complex layouts, handwritten notes, screenshots, scanned documents
  • Features: Vision-language model processing, end-to-end page understanding
  • Performance: Slower, higher memory usage, MLX-accelerated on M2
  • Use when: Standard pipeline fails or document has unusual layouts

ASR Pipeline

  • Best for: Audio files (meetings, lectures, interviews)
  • Features: Whisper-based transcription, multiple model sizes
  • Performance: CPU/GPU intensive depending on model size
  • Use when: Processing audio content

Auto-Detection Logic

The server automatically selects optimal settings:

  1. File Type Detection: Based on extension and content analysis
  2. OCR Decision: Enabled for scanned PDFs and images, disabled for text-based documents
  3. Pipeline Selection: Standard for most documents, VLM suggested for images and complex layouts
  4. Backend Selection: Native parser (dlparse_v4) for quality, pypdfium2 for speed/compatibility

Performance Optimization (Mac M2)

Memory Management

  • Large Files: Automatic chunking and streaming processing
  • Queue System: Prevents memory overflow from concurrent requests
  • Cleanup: Automatic temporary file cleanup after processing

MLX Acceleration

  • VLM models run with MLX optimization on Apple Silicon
  • Reduced memory footprint compared to standard PyTorch
  • Automatic fallback to CPU if MLX unavailable

Configuration

# Environment variables for optimization
export DOCLING_MAX_MEMORY_GB=6        # Limit memory usage
export DOCLING_QUEUE_SIZE=3           # Max concurrent jobs
export DOCLING_ENABLE_MLX=true        # Enable MLX acceleration

Error Handling

Automatic Retry Logic

  • Network timeouts for URL-based files
  • Fallback pipelines if primary fails
  • Alternative OCR engines if primary fails

Error Response Format

{
  "error": {
    "type": "ConversionError",
    "message": "Failed to process document",
    "details": "Specific error information",
    "suggestions": ["Try VLM pipeline", "Enable OCR"]
  }
}

Common Issues & Solutions

Issue Cause Solution
Memory error with large PDF Insufficient RAM Split document or reduce queue size
Poor OCR quality Wrong language/engine Specify language with ocr_language
Scrambled text order PDF parsing issues Try "pdf_backend": "pypdfium2"
Tables not detected Layout complexity Use "table_mode": "accurate"
Slow processing Large/complex document Try "pipeline": "standard" first

Integration Examples

Claude Desktop MCP Configuration

Add this to your Claude Desktop configuration file (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "doc-ingestor": {
      "command": "python",
      "args": ["-m", "doc_ingestor_mcp"],
      "cwd": "/path/to/doc-ingestor-mcp"
    }
  }
}

Testing the Installation

  1. Test basic functionality:
# Start the server in debug mode
python -m doc_ingestor_mcp --debug

# In another terminal, test with a sample file
echo '{"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "convert_document", "arguments": {"source": "test.pdf"}}}' | python -m doc_ingestor_mcp
  1. Test with Claude Desktop:

    • Restart Claude Desktop after adding the MCP configuration
    • In a new conversation, try: "Can you convert this PDF to markdown?" and attach a PDF file
    • The server should appear in Claude's available tools
  2. Test different file types:

# Test with different pipelines
python test_server.py

Create test_server.py:

import asyncio
import json
from doc_ingestor_mcp.server import DocIngestorMCPServer
from doc_ingestor_mcp.config import load_config

async def test_conversion():
    config = load_config("config.yaml")
    server = DocIngestorMCPServer(config)
    
    # Test basic conversion
    result = await server._handle_convert_document({
        "source": "https://arxiv.org/pdf/2408.09869",
        "pipeline": "standard"
    })
    
    print("Conversion successful!")
    print(f"Output length: {len(result[0].text)} characters")

if __name__ == "__main__":
    asyncio.run(test_conversion())

File Size Limits

  • PDFs: Up to 500MB (auto-chunked)
  • Images: Up to 50MB per image
  • Audio: Up to 2GB (processed in segments)
  • Office Docs: Up to 200MB
  • URLs: 10-minute timeout for downloads

Security Considerations

  • Local Processing: All processing happens locally by default
  • Remote Services: Optional (disabled by default) for VLM APIs
  • File Cleanup: Temporary files automatically deleted
  • URL Validation: Safe URL patterns enforced

Troubleshooting

Debug Mode

python -m doc_ingestor_mcp --debug

Log Analysis

tail -f ./logs/server.log

Run Test Suite

python test_server.py

Common Issues

"ModuleNotFoundError: No module named 'docling'"

pip install docling

"MLX not available" warnings

  • This is normal on non-Apple Silicon Macs
  • MLX acceleration is optional and will fallback to CPU

"Queue is full" errors

  • Wait for current jobs to complete
  • Increase max_queue_size in config.yaml

"Download failed" for URLs

  • Check internet connection
  • Verify URL is accessible
  • Some sites may block automated downloads

Memory errors with large files

  • Reduce max_memory_gb in config.yaml
  • Try smaller files first
  • Use pipeline: "standard" instead of vlm

OCR not working

  • Install tesseract: brew install tesseract
  • Install easyocr: pip install easyocr
  • Check language settings in config.yaml

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured