Unstructured API MCP Server for Research Paper Data Processing

Unstructured API MCP Server for Research Paper Data Processing

GitHub repository for Unstructured MCP Hackathon.

HeetVekariya

Research & Data
Visit Server

README

Unstructured API MCP Server for Research Paper Data Processing

By leveraging the Unstructured API, this server facilitates easy access to a set of powerful tools that extract meaningful information from research papers, which can then be used for fine-tuning a language model (LLM) to reduce the literature review time for researchers.

Check out the Blog here:

Table of Contents:

  1. Setup
  2. Requirements
  3. Project Flow
  4. Available Tools
  5. Follow Along
  6. Claude Desktop Integration
  7. Debugging Tools
  8. Running locally minimal client with server

Setup

Install dependencies:

  • uv add "mcp[cli]"
  • uv pip install --upgrade unstructured-client python-dotenv

or use uv sync.

Requirements

Before you can begin working with the UNS_MCP project, make sure you have the following setup:

  1. UNSTRUCTURED_API_KEY

  2. GOOGLEDRIVE_SERVICE_ACCOUNT_KEY

    • Set up a Google Cloud project and create a service account to enable access to Google Drive for reading PDFs. Check the set up process here.
    • Save the JSON credentials for your service account and use it to set up the GOOGLEDRIVE_SERVICE_ACCOUNT_KEY.
  3. MONGO_DB_CONNECTION_STRING

    • Set up a MongoDB database (cloud) and get the connection string for connecting to the database. Check out set up process here.
  4. .env.template

    • The .env.template file includes all the required environment variables. Copy this file to .env and set the necessary values for the keys mentioned above.

    Example .env file:

    UNSTRUCTURED_API_KEY="<key-here>"
    MONGO_DB_CONNECTION_STRING="<CONNECTION_STRING>"
    GOOGLEDRIVE_SERVICE_ACCOUNT_KEY="<converted string>"
    
    
    

Project Flow

  1. User Query to MCP Client

  2. Claude Interacts with UNS_MCP Server

    • Claude forwards the user's query to the custom MCP server named UNS_MCP.
  3. MCP Tool Executes Unstructured API

    • UNS_MCP interacts with the Unstructured API to process the research paper PDF, extract relevant information, and convert it into structured JSON data.
  4. Structured Data (JSON) Output is stored in the destination source

    • The result from the Unstructured API is transformed into JSON format, which can then be further utilized to fine-tune LLMs, helping researchers quickly find the relevant information without manually reading the entire paper.

Available Tools

Tool Description
list_sources Lists available sources from the Unstructured API.
get_source_info Get detailed information about a specific source connector.
create_gdrive_source Create a google drive source connector.
update_gdrive_source Update an existing google source connector by params.
delete_gdrive_source Delete a source connector by source id.
list_destinations Lists available destinations from the Unstructured API.
get_destination_info Get detailed info about a specific destination connector. Currently, we have s3/weaviate/astra/neo4j/mongo DB (more to come!)
create_mongodb_destination Create a mongodb destination connector by params.
update_mongodb_destination Update an existing mongodb destination connector by destination id.
delete_mongodb_destination Delete a mongodb destination connector by destination id.
list_workflows Lists workflows from the Unstructured API.
get_workflow_info Get detailed information about a specific workflow.
create_workflow Create a new workflow with source, destination id, etc.
run_workflow Run a specific workflow with workflow id
update_workflow Update an existing workflow by params.
delete_workflow Delete a specific workflow by id.
list_jobs Lists jobs for a specific workflow from the Unstructured API.
get_job_info Get detailed information about a specific job by job id.
cancel_job Delete a specific job by id.

Follow Along

1. Set Up Required Connectors

Google Drive Source Connector:

  • Create a Google Drive Source Connector to connect your service account with Google Drive and retrieve PDFs.
  • Test the connection to ensure accessibility.

MongoDB Destination Connector:

  • Set up the MongoDB Destination Connector to store processed data.
  • Test the connection to ensure accessibility.

2. Develop the Workflow

  1. Define Connectors: Set up the Google Drive source and MongoDB destination connectors.

  2. Partitioning: Use Auto partitioning for optimal document splitting.

  3. Chunking: Apply by-page chunking for manageable text segments.

  4. Enrichment: Use NER to extract entities and table enrichment for any tables.

  5. Embedding: Convert text into embeddings for querying or analysis.

Note: Tweak the Flow: Adjust any step (partitioning, chunking, enrichment, embedding) as needed.


3. Set Up Claude Desktop

  1. Install Claude Desktop and integrate it with the UNS_MCP server by following steps given below.
  2. Restart Claude to link with the MCP server and ensure workflow functionality.

4. Query and Run the Workflow

  • Use Claude to interact with the system and execute queries to list, create, edit, delete and run the workflow. You can perform many such tasks, go through Available Tools given above.

5. Results

Claude Desktop Integration

To install in Claude Desktop:

  1. Go to claude_desktop_config.json by running the below command.
# For macOS or Linux:
code ~/Library/Application\ Support/Claude/claude_desktop_config.json

# For Windows:
code $env:AppData\Claude\claude_desktop_config.json
  1. In that file add:
{
    "mcpServers":
    {
        "UNS_MCP":
        {
            "command": "ABSOLUTE/PATH/TO/.local/bin/uv",
            "args":
            [
                "--directory",
                "ABSOLUTE/PATH/TO/YOUR-UNS-MCP-REPO/uns_mcp",
                "run",
                "server.py"
            ],
            "env":
            [
            "UNSTRUCTURED_API_KEY":"<your key>"
            ],
            "disabled": false
        }
    }
}
  1. Restart Claude Desktop.

  2. Example Issues seen from Claude Desktop.

    • You will see No destinations found when you query for a list of destination connectors. Check your API key in .env or in your config json, it needs to be your personal key in https://platform.unstructured.io/app/account/api-keys.

Debugging tools

Anthropic provides MCP Inspector tool to debug/test your MCP server. Run the following command to spin up a debugging UI. From there, you will be able to add environment variables (pointing to your local env) on the left pane. Include your personal API key there as env var. Go to tools, you can test out the capabilities you add to the MCP server.

mcp dev uns_mcp/server.py

If you need to log request call parameters to UnstructuredClient, set the environment variable DEBUG_API_REQUESTS=false. The logs are stored in a file with the format unstructured-client-{date}.log, which can be examined to debug request call parameters to UnstructuredClient functions.

Running locally minimal client, accessing local the MCP server over HTTP + SSE

The main difference here is it becomes easier to set breakpoints on the server side during development -- the client and server are decoupled.

# in one terminal, run the server:
uv run python uns_mcp/server.py --host 127.0.0.1 --port 8080

or
make sse-server

# in another terminal, run the client:
uv run python minimal_client/client.py "http://127.0.0.1:8080/sse"
or
make sse-client

Hint: ctrl+c out of the client first, then the server. Otherwise the server appears to hang.

Recommended Servers

Crypto Price & Market Analysis MCP Server

Crypto Price & Market Analysis MCP Server

A Model Context Protocol (MCP) server that provides comprehensive cryptocurrency analysis using the CoinCap API. This server offers real-time price data, market analysis, and historical trends through an easy-to-use interface.

Featured
TypeScript
MCP PubMed Search

MCP PubMed Search

Server to search PubMed (PubMed is a free, online database that allows users to search for biomedical and life sciences literature). I have created on a day MCP came out but was on vacation, I saw someone post similar server in your DB, but figured to post mine.

Featured
Python
dbt Semantic Layer MCP Server

dbt Semantic Layer MCP Server

A server that enables querying the dbt Semantic Layer through natural language conversations with Claude Desktop and other AI assistants, allowing users to discover metrics, create queries, analyze data, and visualize results.

Featured
TypeScript
mixpanel

mixpanel

Connect to your Mixpanel data. Query events, retention, and funnel data from Mixpanel analytics.

Featured
TypeScript
Sequential Thinking MCP Server

Sequential Thinking MCP Server

This server facilitates structured problem-solving by breaking down complex issues into sequential steps, supporting revisions, and enabling multiple solution paths through full MCP integration.

Featured
Python
Nefino MCP Server

Nefino MCP Server

Provides large language models with access to news and information about renewable energy projects in Germany, allowing filtering by location, topic (solar, wind, hydrogen), and date range.

Official
Python
Vectorize

Vectorize

Vectorize MCP server for advanced retrieval, Private Deep Research, Anything-to-Markdown file extraction and text chunking.

Official
JavaScript
Mathematica Documentation MCP server

Mathematica Documentation MCP server

A server that provides access to Mathematica documentation through FastMCP, enabling users to retrieve function documentation and list package symbols from Wolfram Mathematica.

Local
Python
kb-mcp-server

kb-mcp-server

An MCP server aimed to be portable, local, easy and convenient to support semantic/graph based retrieval of txtai "all in one" embeddings database. Any txtai embeddings db in tar.gz form can be loaded

Local
Python
Research MCP Server

Research MCP Server

The server functions as an MCP server to interact with Notion for retrieving and creating survey data, integrating with the Claude Desktop Client for conducting and reviewing surveys.

Local
Python