paper-extraction-MCP
Enables structured information extraction from academic PDFs using LLMs, integrating with Claude Desktop for natural language querying and batch processing.
README
Paper Extraction MCP
An MCP (Model Context Protocol) server for structured information extraction from academic PDF papers using LLM. It integrates seamlessly with Claude Desktop, allowing you to extract metadata and domain-specific content categories from research papers through natural language conversation.
Typhoon disaster governance is provided as a built-in example. The system is fully customizable for any research domain — see Adapting to Other Domains.
Features
- LLM-Powered Extraction — Uses OpenAI-compatible LLMs to extract structured data from full-text PDFs
- Customizable Schema — Define your own metadata fields and content categories via
config.json - Smart Chunking — Automatically splits long papers and merges results with deduplication
- Dual Output — JSON and CSV formats for downstream analysis
- MCP Protocol — Works directly inside Claude Desktop as a tool server
- Batch Processing — Extract from a single paper or all papers at once
Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Configure API Key
Copy the template and fill in your API key:
cp config.json.template config.json
Edit config.json:
{
"llm_config": {
"api_key": "sk-your-api-key-here",
"api_base": "https://api.openai.com/v1",
"model": "gpt-4o"
}
}
Any OpenAI-compatible API is supported (OpenAI, Azure OpenAI, local LLMs with OpenAI-compatible endpoints, third-party proxies, etc.).
3. Add PDF Papers
Place your PDF files in the papers/ directory.
4. Configure Claude Desktop
Edit the Claude Desktop config file:
- Windows:
%APPDATA%\Claude\claude_desktop_config.json - macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
Add:
{
"mcpServers": {
"paper-extraction": {
"command": "python",
"args": ["<full-path-to>/paper-extraction-MCP/server.py"],
"cwd": "<full-path-to>/paper-extraction-MCP"
}
}
}
Replace <full-path-to> with your actual path. Then restart Claude Desktop.
5. Use It
In Claude Desktop, simply say:
List the PDF papers available for extraction.
Extract the paper "my_paper.pdf".
Extract all papers and show me a summary.
MCP Tools
| Tool | Description |
|---|---|
list_papers |
List all PDF files in the papers/ directory |
extract_paper |
Extract metadata + categories from a single PDF |
extract_all_papers |
Batch extract all PDFs |
get_extraction_result |
Retrieve a previously extracted result (JSON) |
Project Structure
paper-extraction-MCP/
├── server.py # MCP server (entry point)
├── pdf_extractor.py # Core extraction logic
├── config.json # Your configuration (gitignored)
├── config.json.template # Configuration template
├── requirements.txt # Python dependencies
├── papers/ # Place PDF files here
├── outputs/
│ ├── json/ # JSON extraction results
│ └── csv/ # CSV extraction results
├── setup.bat # Windows quick setup
├── setup.sh # macOS/Linux quick setup
├── CLAUDE_SETUP.md # Detailed Claude Desktop setup guide
├── LICENSE # MIT License
└── README.md
How It Works
PDF file
│
▼
pdfplumber (text extraction)
│
▼
Full text ──► LLM API call ──► Structured JSON
│
config.json defines:
- extraction_prompt (fields & rules)
- llm_config (model, temperature)
│
▼
JSON + CSV output
- Text Extraction:
pdfplumberextracts full text from each PDF page - LLM Extraction: The text is sent to an LLM with your
extraction_prompt, which defines what fields to extract and how - Smart Chunking: If the text exceeds ~100K characters, it is automatically split into chunks, extracted separately, and merged with deduplication
- Output: Results are saved as JSON and CSV
Built-in Example: Typhoon Disaster Governance
The default config.json is pre-configured for extracting information from typhoon disaster governance papers:
Metadata fields (5):
- DOI, Title, Journal, Author Affiliations, Publication Date
Content categories (7):
| Category | Description |
|---|---|
| Detection & Early Warning | Monitoring, forecasting, alert systems |
| Engineering Protection | Seawalls, drainage, building reinforcement |
| Emergency Response | Evacuation, shelters, rescue operations |
| Post-disaster Recovery | Reconstruction, ecological restoration |
| Policy & Management | Regulations, institutional coordination |
| Digital Technology | AI, big data, remote sensing, GIS, IoT |
| Other Measures | Community-based, education, insurance |
Adapting to Other Domains
The core of this tool is domain-agnostic. You only need to modify config.json — no code changes required. Here is a step-by-step guide:
Step 1: Define Your Categories
Decide what information you want to extract. For example:
| Domain | Possible Categories |
|---|---|
| Climate Change Adaptation | Mitigation measures, Adaptation strategies, Carbon reduction technologies, Policy instruments, Financial mechanisms |
| Urban Planning | Land use strategies, Transportation planning, Green infrastructure, Zoning regulations, Community engagement |
| Public Health | Prevention measures, Treatment protocols, Surveillance systems, Policy interventions, Technology applications |
| Cybersecurity | Threat detection, Prevention measures, Incident response, Recovery procedures, Governance frameworks |
| Supply Chain | Risk identification, Mitigation strategies, Resilience measures, Technology solutions, Regulatory compliance |
Step 2: Write Your Extraction Prompt
Edit the extraction_prompt field in config.json. The prompt should:
- Describe the assistant's role for your domain
- List metadata fields (DOI, title, journal, etc. — usually the same across domains)
- Define each content category with clear descriptions and examples
- Set extraction rules (no hallucination, preserve original text, deduplication)
- Specify the output JSON format with exact key names
Here is a template you can adapt:
{
"extraction_prompt": [
"You are an academic information extraction assistant specialized in [YOUR DOMAIN].",
"",
"From each paper, extract:",
"1) Bibliographic metadata",
"2) Domain-specific content, categorized as follows:",
"",
"METADATA FIELDS:",
"- 论文DOI: Full DOI URL",
"- 题目: Paper title",
"- 期刊名称: Journal name",
"- 作者机构: Author affiliations (semicolon-separated)",
"- 发表日期: Publication date (Month Year)",
"",
"CONTENT CATEGORIES:",
"- [Category1_Key]: [Description of what to extract]",
"- [Category2_Key]: [Description of what to extract]",
"- ... (add as many as needed)",
"",
"RULES:",
"- Only extract content explicitly present in the paper",
"- Preserve original text, do not summarize",
"- Output as JSON with metadata as strings, categories as arrays of strings"
]
}
Step 3: Update the Field Mapping in pdf_extractor.py
If you change the Chinese key names in your extraction prompt (e.g., use "预防措施" instead of "检测预警措施"), update the _format_result() and _merge_chunk_results() methods in pdf_extractor.py to map your new keys to the internal field names.
For example, if your domain is public health:
# In _format_result():
final_result = {
# ... metadata fields stay the same ...
"prevention_measures": self._join_measures(result_data.get("预防措施", [])),
"treatment_measures": self._join_measures(result_data.get("治疗措施", [])),
"surveillance_measures": self._join_measures(result_data.get("监测措施", [])),
# ... add your categories ...
}
Step 4: Update server.py Display (Optional)
If you want the MCP tool output to show your custom category names, update the call_tool() function in server.py where it formats the extraction result display.
Tips for Writing Good Extraction Prompts
- Be specific: Provide concrete examples of what belongs in each category
- Set boundaries: Clearly state what does NOT belong in each category to avoid overlap
- Request detail: Ask for full paragraphs, not just keywords — this prevents information loss
- Use the paper's language: Tell the LLM to preserve the original language (Chinese/English)
- Test iteratively: Try your prompt on 2-3 papers, review the results, and refine
Configuration Reference
config.json
| Field | Type | Description |
|---|---|---|
papers_dir |
string | Directory containing PDF files (default: "papers") |
output_dir |
string | Output directory (default: "outputs") |
extraction_prompt |
string or string[] | The LLM prompt defining extraction fields and rules |
llm_config.enabled |
bool | Enable/disable LLM extraction |
llm_config.provider |
string | LLM provider (currently "openai") |
llm_config.model |
string | Model name (e.g., "gpt-4o", "gpt-4-turbo") |
llm_config.api_key |
string | Your API key |
llm_config.api_base |
string | API base URL |
llm_config.temperature |
number | Generation temperature (0 = deterministic) |
Recommended Models
| Model | Speed | Quality | Cost |
|---|---|---|---|
gpt-4o |
Fast | High | Medium |
gpt-4-turbo |
Medium | Highest | High |
gpt-3.5-turbo |
Fastest | Good | Low |
Any OpenAI-compatible model works (DeepSeek, Qwen, local Ollama, etc.).
Cost Estimate
Using GPT-4o:
- Single paper (~10 pages): ~$0.04-0.07
- 100 papers: ~$4-7
Troubleshooting
| Problem | Solution |
|---|---|
| MCP server not visible in Claude | Check config path, restart Claude Desktop |
| API call fails | Verify API key, check network, check account balance |
| Empty extraction | Ensure PDF is text-based (not scanned images) |
| Incomplete results | Paper may be too long — chunking handles this automatically |
See CLAUDE_SETUP.md for a detailed setup and troubleshooting guide.
License
If this project helps your research, please give it a star!
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.