paper-extraction-MCP

paper-extraction-MCP

Enables structured information extraction from academic PDFs using LLMs, integrating with Claude Desktop for natural language querying and batch processing.

Category
Visit Server

README

Paper Extraction MCP

License: MIT Python 3.8+ MCP

An MCP (Model Context Protocol) server for structured information extraction from academic PDF papers using LLM. It integrates seamlessly with Claude Desktop, allowing you to extract metadata and domain-specific content categories from research papers through natural language conversation.

Typhoon disaster governance is provided as a built-in example. The system is fully customizable for any research domain — see Adapting to Other Domains.

Features

  • LLM-Powered Extraction — Uses OpenAI-compatible LLMs to extract structured data from full-text PDFs
  • Customizable Schema — Define your own metadata fields and content categories via config.json
  • Smart Chunking — Automatically splits long papers and merges results with deduplication
  • Dual Output — JSON and CSV formats for downstream analysis
  • MCP Protocol — Works directly inside Claude Desktop as a tool server
  • Batch Processing — Extract from a single paper or all papers at once

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure API Key

Copy the template and fill in your API key:

cp config.json.template config.json

Edit config.json:

{
  "llm_config": {
    "api_key": "sk-your-api-key-here",
    "api_base": "https://api.openai.com/v1",
    "model": "gpt-4o"
  }
}

Any OpenAI-compatible API is supported (OpenAI, Azure OpenAI, local LLMs with OpenAI-compatible endpoints, third-party proxies, etc.).

3. Add PDF Papers

Place your PDF files in the papers/ directory.

4. Configure Claude Desktop

Edit the Claude Desktop config file:

  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

Add:

{
  "mcpServers": {
    "paper-extraction": {
      "command": "python",
      "args": ["<full-path-to>/paper-extraction-MCP/server.py"],
      "cwd": "<full-path-to>/paper-extraction-MCP"
    }
  }
}

Replace <full-path-to> with your actual path. Then restart Claude Desktop.

5. Use It

In Claude Desktop, simply say:

List the PDF papers available for extraction.
Extract the paper "my_paper.pdf".
Extract all papers and show me a summary.

MCP Tools

Tool Description
list_papers List all PDF files in the papers/ directory
extract_paper Extract metadata + categories from a single PDF
extract_all_papers Batch extract all PDFs
get_extraction_result Retrieve a previously extracted result (JSON)

Project Structure

paper-extraction-MCP/
├── server.py              # MCP server (entry point)
├── pdf_extractor.py       # Core extraction logic
├── config.json            # Your configuration (gitignored)
├── config.json.template   # Configuration template
├── requirements.txt       # Python dependencies
├── papers/                # Place PDF files here
├── outputs/
│   ├── json/              # JSON extraction results
│   └── csv/               # CSV extraction results
├── setup.bat              # Windows quick setup
├── setup.sh               # macOS/Linux quick setup
├── CLAUDE_SETUP.md        # Detailed Claude Desktop setup guide
├── LICENSE                # MIT License
└── README.md

How It Works

PDF file
  │
  ▼
pdfplumber (text extraction)
  │
  ▼
Full text ──► LLM API call ──► Structured JSON
                  │
          config.json defines:
          - extraction_prompt (fields & rules)
          - llm_config (model, temperature)
                  │
                  ▼
          JSON + CSV output
  1. Text Extraction: pdfplumber extracts full text from each PDF page
  2. LLM Extraction: The text is sent to an LLM with your extraction_prompt, which defines what fields to extract and how
  3. Smart Chunking: If the text exceeds ~100K characters, it is automatically split into chunks, extracted separately, and merged with deduplication
  4. Output: Results are saved as JSON and CSV

Built-in Example: Typhoon Disaster Governance

The default config.json is pre-configured for extracting information from typhoon disaster governance papers:

Metadata fields (5):

  • DOI, Title, Journal, Author Affiliations, Publication Date

Content categories (7):

Category Description
Detection & Early Warning Monitoring, forecasting, alert systems
Engineering Protection Seawalls, drainage, building reinforcement
Emergency Response Evacuation, shelters, rescue operations
Post-disaster Recovery Reconstruction, ecological restoration
Policy & Management Regulations, institutional coordination
Digital Technology AI, big data, remote sensing, GIS, IoT
Other Measures Community-based, education, insurance

Adapting to Other Domains

The core of this tool is domain-agnostic. You only need to modify config.json — no code changes required. Here is a step-by-step guide:

Step 1: Define Your Categories

Decide what information you want to extract. For example:

Domain Possible Categories
Climate Change Adaptation Mitigation measures, Adaptation strategies, Carbon reduction technologies, Policy instruments, Financial mechanisms
Urban Planning Land use strategies, Transportation planning, Green infrastructure, Zoning regulations, Community engagement
Public Health Prevention measures, Treatment protocols, Surveillance systems, Policy interventions, Technology applications
Cybersecurity Threat detection, Prevention measures, Incident response, Recovery procedures, Governance frameworks
Supply Chain Risk identification, Mitigation strategies, Resilience measures, Technology solutions, Regulatory compliance

Step 2: Write Your Extraction Prompt

Edit the extraction_prompt field in config.json. The prompt should:

  1. Describe the assistant's role for your domain
  2. List metadata fields (DOI, title, journal, etc. — usually the same across domains)
  3. Define each content category with clear descriptions and examples
  4. Set extraction rules (no hallucination, preserve original text, deduplication)
  5. Specify the output JSON format with exact key names

Here is a template you can adapt:

{
  "extraction_prompt": [
    "You are an academic information extraction assistant specialized in [YOUR DOMAIN].",
    "",
    "From each paper, extract:",
    "1) Bibliographic metadata",
    "2) Domain-specific content, categorized as follows:",
    "",
    "METADATA FIELDS:",
    "- 论文DOI: Full DOI URL",
    "- 题目: Paper title",
    "- 期刊名称: Journal name",
    "- 作者机构: Author affiliations (semicolon-separated)",
    "- 发表日期: Publication date (Month Year)",
    "",
    "CONTENT CATEGORIES:",
    "- [Category1_Key]: [Description of what to extract]",
    "- [Category2_Key]: [Description of what to extract]",
    "- ... (add as many as needed)",
    "",
    "RULES:",
    "- Only extract content explicitly present in the paper",
    "- Preserve original text, do not summarize",
    "- Output as JSON with metadata as strings, categories as arrays of strings"
  ]
}

Step 3: Update the Field Mapping in pdf_extractor.py

If you change the Chinese key names in your extraction prompt (e.g., use "预防措施" instead of "检测预警措施"), update the _format_result() and _merge_chunk_results() methods in pdf_extractor.py to map your new keys to the internal field names.

For example, if your domain is public health:

# In _format_result():
final_result = {
    # ... metadata fields stay the same ...
    "prevention_measures": self._join_measures(result_data.get("预防措施", [])),
    "treatment_measures": self._join_measures(result_data.get("治疗措施", [])),
    "surveillance_measures": self._join_measures(result_data.get("监测措施", [])),
    # ... add your categories ...
}

Step 4: Update server.py Display (Optional)

If you want the MCP tool output to show your custom category names, update the call_tool() function in server.py where it formats the extraction result display.

Tips for Writing Good Extraction Prompts

  1. Be specific: Provide concrete examples of what belongs in each category
  2. Set boundaries: Clearly state what does NOT belong in each category to avoid overlap
  3. Request detail: Ask for full paragraphs, not just keywords — this prevents information loss
  4. Use the paper's language: Tell the LLM to preserve the original language (Chinese/English)
  5. Test iteratively: Try your prompt on 2-3 papers, review the results, and refine

Configuration Reference

config.json

Field Type Description
papers_dir string Directory containing PDF files (default: "papers")
output_dir string Output directory (default: "outputs")
extraction_prompt string or string[] The LLM prompt defining extraction fields and rules
llm_config.enabled bool Enable/disable LLM extraction
llm_config.provider string LLM provider (currently "openai")
llm_config.model string Model name (e.g., "gpt-4o", "gpt-4-turbo")
llm_config.api_key string Your API key
llm_config.api_base string API base URL
llm_config.temperature number Generation temperature (0 = deterministic)

Recommended Models

Model Speed Quality Cost
gpt-4o Fast High Medium
gpt-4-turbo Medium Highest High
gpt-3.5-turbo Fastest Good Low

Any OpenAI-compatible model works (DeepSeek, Qwen, local Ollama, etc.).

Cost Estimate

Using GPT-4o:

  • Single paper (~10 pages): ~$0.04-0.07
  • 100 papers: ~$4-7

Troubleshooting

Problem Solution
MCP server not visible in Claude Check config path, restart Claude Desktop
API call fails Verify API key, check network, check account balance
Empty extraction Ensure PDF is text-based (not scanned images)
Incomplete results Paper may be too long — chunking handles this automatically

See CLAUDE_SETUP.md for a detailed setup and troubleshooting guide.

License

MIT License


If this project helps your research, please give it a star!

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured