doc-ingestor-mcp
An MCP server that uses Docling to convert PDFs, Office documents, images, audio, and more into clean Markdown for AI processing and RAG pipelines.
README
doc-ingestor-mcp
An MCP (Model Context Protocol) server that provides intelligent document ingestion capabilities using the Docling toolkit. Convert any document (PDF, DOCX, images, HTML, etc.) into clean Markdown for AI processing and RAG pipelines.
Features
- Universal File Support: PDFs, DOCX/XLSX/PPTX, images (PNG/JPEG/TIFF/BMP/WEBP), HTML, Markdown, CSV, audio files, and more
- Flexible Input: Process local files or remote URLs
- Multiple Processing Pipelines: Standard (fast, high-quality), VLM (vision-language models), ASR (audio transcription)
- Intelligent Auto-Detection: Automatically selects optimal settings based on file type and content
- Queue Management: Handles concurrent requests with proper job queuing
- Mac M2 Optimized: Efficient memory usage and MLX acceleration support
- Clean Markdown Output: High-quality structured text ready for AI consumption
Installation
Prerequisites
- Python 3.9+ (recommended: 3.11+)
- macOS (optimized for Apple Silicon M2)
- 8GB+ RAM recommended
Setup
- Clone and install dependencies:
git clone <repository-url>
cd doc-ingestor-mcp
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Install Docling with Mac optimizations:
# Core Docling with MLX acceleration for Apple Silicon
pip install docling
# For MLX support (Apple Silicon only):
pip install docling[mlx]
# Optional: additional OCR engines
pip install easyocr
# Install tesseract via homebrew: brew install tesseract
- Start the MCP server:
python -m doc_ingestor_mcp
The server will start and listen for MCP connections using stdio transport.
MCP Tools
The server provides the following MCP tools:
convert_document
Converts any supported document to Markdown.
Parameters:
source(required): File path or URL to the documentpipeline(optional): Processing pipeline -"standard","vlm", or"asr"options(optional): Additional processing options
Example:
{
"name": "convert_document",
"arguments": {
"source": "https://arxiv.org/pdf/2408.09869",
"pipeline": "standard"
}
}
Response:
{
"content": [
{
"type": "text",
"text": "# Document Title\n\nConverted markdown content here..."
}
]
}
convert_document_advanced
Advanced conversion with detailed configuration options.
Parameters:
source(required): File path or URLpipeline(optional):"standard","vlm","asr"ocr_enabled(optional): Enable/disable OCR (default: auto-detect)ocr_language(optional): OCR language codes (e.g., "eng,spa")table_mode(optional):"fast"or"accurate"pdf_backend(optional):"dlparse_v4"or"pypdfium2"enable_enrichments(optional): Enable code/formula/picture enrichments
Example:
{
"name": "convert_document_advanced",
"arguments": {
"source": "./scanned-document.pdf",
"pipeline": "standard",
"ocr_enabled": true,
"ocr_language": "eng",
"table_mode": "accurate"
}
}
get_processing_status
Check the status of ongoing conversions (useful for large files).
Parameters:
job_id(required): Job identifier returned from conversion requests
list_supported_formats
Returns all supported input and output formats.
Response:
{
"input_formats": ["pdf", "docx", "xlsx", "pptx", "png", "jpeg", "html", "md", "csv", "mp3", "wav"],
"output_formats": ["markdown", "html", "json", "text", "doctags"],
"pipelines": ["standard", "vlm", "asr"]
}
Usage Examples
Basic PDF Conversion
{
"name": "convert_document",
"arguments": {
"source": "./research-paper.pdf"
}
}
URL-based Conversion with VLM Pipeline
{
"name": "convert_document",
"arguments": {
"source": "https://example.com/complex-document.pdf",
"pipeline": "vlm"
}
}
Audio Transcription
{
"name": "convert_document",
"arguments": {
"source": "./meeting-recording.mp3",
"pipeline": "asr"
}
}
Scanned Document with OCR
{
"name": "convert_document_advanced",
"arguments": {
"source": "./scanned-invoice.pdf",
"ocr_enabled": true,
"ocr_language": "eng",
"table_mode": "accurate"
}
}
Pipeline Selection Guide
Standard Pipeline (Default)
- Best for: Born-digital PDFs, Office documents, clean layouts
- Features: Advanced layout analysis, table structure recovery, optional OCR
- Performance: Fast, memory-efficient
- Use when: Document has programmatic text and standard layouts
VLM Pipeline
- Best for: Complex layouts, handwritten notes, screenshots, scanned documents
- Features: Vision-language model processing, end-to-end page understanding
- Performance: Slower, higher memory usage, MLX-accelerated on M2
- Use when: Standard pipeline fails or document has unusual layouts
ASR Pipeline
- Best for: Audio files (meetings, lectures, interviews)
- Features: Whisper-based transcription, multiple model sizes
- Performance: CPU/GPU intensive depending on model size
- Use when: Processing audio content
Auto-Detection Logic
The server automatically selects optimal settings:
- File Type Detection: Based on extension and content analysis
- OCR Decision: Enabled for scanned PDFs and images, disabled for text-based documents
- Pipeline Selection: Standard for most documents, VLM suggested for images and complex layouts
- Backend Selection: Native parser (dlparse_v4) for quality, pypdfium2 for speed/compatibility
Performance Optimization (Mac M2)
Memory Management
- Large Files: Automatic chunking and streaming processing
- Queue System: Prevents memory overflow from concurrent requests
- Cleanup: Automatic temporary file cleanup after processing
MLX Acceleration
- VLM models run with MLX optimization on Apple Silicon
- Reduced memory footprint compared to standard PyTorch
- Automatic fallback to CPU if MLX unavailable
Configuration
# Environment variables for optimization
export DOCLING_MAX_MEMORY_GB=6 # Limit memory usage
export DOCLING_QUEUE_SIZE=3 # Max concurrent jobs
export DOCLING_ENABLE_MLX=true # Enable MLX acceleration
Error Handling
Automatic Retry Logic
- Network timeouts for URL-based files
- Fallback pipelines if primary fails
- Alternative OCR engines if primary fails
Error Response Format
{
"error": {
"type": "ConversionError",
"message": "Failed to process document",
"details": "Specific error information",
"suggestions": ["Try VLM pipeline", "Enable OCR"]
}
}
Common Issues & Solutions
| Issue | Cause | Solution |
|---|---|---|
| Memory error with large PDF | Insufficient RAM | Split document or reduce queue size |
| Poor OCR quality | Wrong language/engine | Specify language with ocr_language |
| Scrambled text order | PDF parsing issues | Try "pdf_backend": "pypdfium2" |
| Tables not detected | Layout complexity | Use "table_mode": "accurate" |
| Slow processing | Large/complex document | Try "pipeline": "standard" first |
Integration Examples
Claude Desktop MCP Configuration
Add this to your Claude Desktop configuration file (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"doc-ingestor": {
"command": "python",
"args": ["-m", "doc_ingestor_mcp"],
"cwd": "/path/to/doc-ingestor-mcp"
}
}
}
Testing the Installation
- Test basic functionality:
# Start the server in debug mode
python -m doc_ingestor_mcp --debug
# In another terminal, test with a sample file
echo '{"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "convert_document", "arguments": {"source": "test.pdf"}}}' | python -m doc_ingestor_mcp
-
Test with Claude Desktop:
- Restart Claude Desktop after adding the MCP configuration
- In a new conversation, try: "Can you convert this PDF to markdown?" and attach a PDF file
- The server should appear in Claude's available tools
-
Test different file types:
# Test with different pipelines
python test_server.py
Create test_server.py:
import asyncio
import json
from doc_ingestor_mcp.server import DocIngestorMCPServer
from doc_ingestor_mcp.config import load_config
async def test_conversion():
config = load_config("config.yaml")
server = DocIngestorMCPServer(config)
# Test basic conversion
result = await server._handle_convert_document({
"source": "https://arxiv.org/pdf/2408.09869",
"pipeline": "standard"
})
print("Conversion successful!")
print(f"Output length: {len(result[0].text)} characters")
if __name__ == "__main__":
asyncio.run(test_conversion())
File Size Limits
- PDFs: Up to 500MB (auto-chunked)
- Images: Up to 50MB per image
- Audio: Up to 2GB (processed in segments)
- Office Docs: Up to 200MB
- URLs: 10-minute timeout for downloads
Security Considerations
- Local Processing: All processing happens locally by default
- Remote Services: Optional (disabled by default) for VLM APIs
- File Cleanup: Temporary files automatically deleted
- URL Validation: Safe URL patterns enforced
Troubleshooting
Debug Mode
python -m doc_ingestor_mcp --debug
Log Analysis
tail -f ./logs/server.log
Run Test Suite
python test_server.py
Common Issues
"ModuleNotFoundError: No module named 'docling'"
pip install docling
"MLX not available" warnings
- This is normal on non-Apple Silicon Macs
- MLX acceleration is optional and will fallback to CPU
"Queue is full" errors
- Wait for current jobs to complete
- Increase
max_queue_sizein config.yaml
"Download failed" for URLs
- Check internet connection
- Verify URL is accessible
- Some sites may block automated downloads
Memory errors with large files
- Reduce
max_memory_gbin config.yaml - Try smaller files first
- Use
pipeline: "standard"instead ofvlm
OCR not working
- Install tesseract:
brew install tesseract - Install easyocr:
pip install easyocr - Check language settings in config.yaml
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
MIT License - see LICENSE file for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Docling Project Docs
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.