MCP Servers

paper-extraction-MCP

Enables structured information extraction from academic PDFs using LLMs, integrating with Claude Desktop for natural language querying and batch processing.

README

Paper Extraction MCP

An MCP (Model Context Protocol) server for structured information extraction from academic PDF papers using LLM. It integrates seamlessly with Claude Desktop, allowing you to extract metadata and domain-specific content categories from research papers through natural language conversation.

Typhoon disaster governance is provided as a built-in example. The system is fully customizable for any research domain — see Adapting to Other Domains.

Features

LLM-Powered Extraction — Uses OpenAI-compatible LLMs to extract structured data from full-text PDFs
Customizable Schema — Define your own metadata fields and content categories via config.json
Smart Chunking — Automatically splits long papers and merges results with deduplication
Dual Output — JSON and CSV formats for downstream analysis
MCP Protocol — Works directly inside Claude Desktop as a tool server
Batch Processing — Extract from a single paper or all papers at once

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure API Key

Copy the template and fill in your API key:

cp config.json.template config.json

Edit config.json:

{
  "llm_config": {
    "api_key": "sk-your-api-key-here",
    "api_base": "https://api.openai.com/v1",
    "model": "gpt-4o"
  }
}

Any OpenAI-compatible API is supported (OpenAI, Azure OpenAI, local LLMs with OpenAI-compatible endpoints, third-party proxies, etc.).

3. Add PDF Papers

Place your PDF files in the papers/ directory.

4. Configure Claude Desktop

Edit the Claude Desktop config file:

Windows: %APPDATA%\Claude\claude_desktop_config.json
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

Add:

{
  "mcpServers": {
    "paper-extraction": {
      "command": "python",
      "args": ["<full-path-to>/paper-extraction-MCP/server.py"],
      "cwd": "<full-path-to>/paper-extraction-MCP"
    }
  }
}

Replace <full-path-to> with your actual path. Then restart Claude Desktop.

5. Use It

In Claude Desktop, simply say:

List the PDF papers available for extraction.

Extract the paper "my_paper.pdf".

Extract all papers and show me a summary.

MCP Tools

Tool	Description
`list_papers`	List all PDF files in the `papers/` directory
`extract_paper`	Extract metadata + categories from a single PDF
`extract_all_papers`	Batch extract all PDFs
`get_extraction_result`	Retrieve a previously extracted result (JSON)

Project Structure

paper-extraction-MCP/
├── server.py              # MCP server (entry point)
├── pdf_extractor.py       # Core extraction logic
├── config.json            # Your configuration (gitignored)
├── config.json.template   # Configuration template
├── requirements.txt       # Python dependencies
├── papers/                # Place PDF files here
├── outputs/
│   ├── json/              # JSON extraction results
│   └── csv/               # CSV extraction results
├── setup.bat              # Windows quick setup
├── setup.sh               # macOS/Linux quick setup
├── CLAUDE_SETUP.md        # Detailed Claude Desktop setup guide
├── LICENSE                # MIT License
└── README.md

How It Works

PDF file
  │
  ▼
pdfplumber (text extraction)
  │
  ▼
Full text ──► LLM API call ──► Structured JSON
                  │
          config.json defines:
          - extraction_prompt (fields & rules)
          - llm_config (model, temperature)
                  │
                  ▼
          JSON + CSV output

Text Extraction: pdfplumber extracts full text from each PDF page
LLM Extraction: The text is sent to an LLM with your extraction_prompt, which defines what fields to extract and how
Smart Chunking: If the text exceeds ~100K characters, it is automatically split into chunks, extracted separately, and merged with deduplication
Output: Results are saved as JSON and CSV

Built-in Example: Typhoon Disaster Governance

The default config.json is pre-configured for extracting information from typhoon disaster governance papers:

Metadata fields (5):

DOI, Title, Journal, Author Affiliations, Publication Date

Content categories (7):

Category	Description
Detection & Early Warning	Monitoring, forecasting, alert systems
Engineering Protection	Seawalls, drainage, building reinforcement
Emergency Response	Evacuation, shelters, rescue operations
Post-disaster Recovery	Reconstruction, ecological restoration
Policy & Management	Regulations, institutional coordination
Digital Technology	AI, big data, remote sensing, GIS, IoT
Other Measures	Community-based, education, insurance

Adapting to Other Domains

The core of this tool is domain-agnostic. You only need to modify config.json — no code changes required. Here is a step-by-step guide:

Step 1: Define Your Categories

Decide what information you want to extract. For example:

Domain	Possible Categories
Climate Change Adaptation	Mitigation measures, Adaptation strategies, Carbon reduction technologies, Policy instruments, Financial mechanisms
Urban Planning	Land use strategies, Transportation planning, Green infrastructure, Zoning regulations, Community engagement
Public Health	Prevention measures, Treatment protocols, Surveillance systems, Policy interventions, Technology applications
Cybersecurity	Threat detection, Prevention measures, Incident response, Recovery procedures, Governance frameworks
Supply Chain	Risk identification, Mitigation strategies, Resilience measures, Technology solutions, Regulatory compliance

Step 2: Write Your Extraction Prompt

Edit the extraction_prompt field in config.json. The prompt should:

Describe the assistant's role for your domain
List metadata fields (DOI, title, journal, etc. — usually the same across domains)
Define each content category with clear descriptions and examples
Set extraction rules (no hallucination, preserve original text, deduplication)
Specify the output JSON format with exact key names

Here is a template you can adapt:

{
  "extraction_prompt": [
    "You are an academic information extraction assistant specialized in [YOUR DOMAIN].",
    "",
    "From each paper, extract:",
    "1) Bibliographic metadata",
    "2) Domain-specific content, categorized as follows:",
    "",
    "METADATA FIELDS:",
    "- 论文DOI: Full DOI URL",
    "- 题目: Paper title",
    "- 期刊名称: Journal name",
    "- 作者机构: Author affiliations (semicolon-separated)",
    "- 发表日期: Publication date (Month Year)",
    "",
    "CONTENT CATEGORIES:",
    "- [Category1_Key]: [Description of what to extract]",
    "- [Category2_Key]: [Description of what to extract]",
    "- ... (add as many as needed)",
    "",
    "RULES:",
    "- Only extract content explicitly present in the paper",
    "- Preserve original text, do not summarize",
    "- Output as JSON with metadata as strings, categories as arrays of strings"
  ]
}

Step 3: Update the Field Mapping in `pdf_extractor.py`

If you change the Chinese key names in your extraction prompt (e.g., use "预防措施" instead of "检测预警措施"), update the _format_result() and _merge_chunk_results() methods in pdf_extractor.py to map your new keys to the internal field names.

For example, if your domain is public health:

# In _format_result():
final_result = {
    # ... metadata fields stay the same ...
    "prevention_measures": self._join_measures(result_data.get("预防措施", [])),
    "treatment_measures": self._join_measures(result_data.get("治疗措施", [])),
    "surveillance_measures": self._join_measures(result_data.get("监测措施", [])),
    # ... add your categories ...
}

Step 4: Update `server.py` Display (Optional)

If you want the MCP tool output to show your custom category names, update the call_tool() function in server.py where it formats the extraction result display.

Tips for Writing Good Extraction Prompts

Be specific: Provide concrete examples of what belongs in each category
Set boundaries: Clearly state what does NOT belong in each category to avoid overlap
Request detail: Ask for full paragraphs, not just keywords — this prevents information loss
Use the paper's language: Tell the LLM to preserve the original language (Chinese/English)
Test iteratively: Try your prompt on 2-3 papers, review the results, and refine

Configuration Reference

config.json

Field	Type	Description
`papers_dir`	string	Directory containing PDF files (default: `"papers"`)
`output_dir`	string	Output directory (default: `"outputs"`)
`extraction_prompt`	string or string[]	The LLM prompt defining extraction fields and rules
`llm_config.enabled`	bool	Enable/disable LLM extraction
`llm_config.provider`	string	LLM provider (currently `"openai"`)
`llm_config.model`	string	Model name (e.g., `"gpt-4o"`, `"gpt-4-turbo"`)
`llm_config.api_key`	string	Your API key
`llm_config.api_base`	string	API base URL
`llm_config.temperature`	number	Generation temperature (0 = deterministic)

Recommended Models

Model	Speed	Quality	Cost
`gpt-4o`	Fast	High	Medium
`gpt-4-turbo`	Medium	Highest	High
`gpt-3.5-turbo`	Fastest	Good	Low

Any OpenAI-compatible model works (DeepSeek, Qwen, local Ollama, etc.).

Cost Estimate

Using GPT-4o:

Single paper (~10 pages): ~$0.04-0.07
100 papers: ~$4-7

Troubleshooting

Problem	Solution
MCP server not visible in Claude	Check config path, restart Claude Desktop
API call fails	Verify API key, check network, check account balance
Empty extraction	Ensure PDF is text-based (not scanned images)
Incomplete results	Paper may be too long — chunking handles this automatically

See CLAUDE_SETUP.md for a detailed setup and troubleshooting guide.

License

MIT License

If this project helps your research, please give it a star!

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

paper-extraction-MCP

README

Paper Extraction MCP

Features

Quick Start

1. Install Dependencies

2. Configure API Key

3. Add PDF Papers

4. Configure Claude Desktop

5. Use It

MCP Tools

Project Structure

How It Works

Built-in Example: Typhoon Disaster Governance

Adapting to Other Domains

Step 1: Define Your Categories

Step 2: Write Your Extraction Prompt

Step 3: Update the Field Mapping in pdf_extractor.py

Step 4: Update server.py Display (Optional)

Tips for Writing Good Extraction Prompts

Configuration Reference

config.json

Recommended Models

Cost Estimate

Troubleshooting

License

Recommended Servers

Step 3: Update the Field Mapping in `pdf_extractor.py`

Step 4: Update `server.py` Display (Optional)