PNDA-MCP
Enables AI agents to search and retrieve metadata and data files from Peru's National Open Data Platform, and generate Jupyter notebooks for data analysis.
README
<div align="center">
PNDA-MCP
Model Context Protocol (MCP) Server for PNDA - National Open Data Platform / Plataforma Nacional de Datos Abiertos (Peru)
👨💻 Author
Ivan Yang Rodriguez Carranza
</div>
📋 Table of Contents
- 🎯 Overview
- 🎬 Demo
- 🔧 Tools
- 💬 Prompts
- 🚀 How to Use
- 💡 Examples
- 🏛️ Architecture Diagram
- ⚙️ ETL Pipeline
- 📝 License
🎯 Overview
PNDA-MCP is a Model Context Protocol (MCP) server for Peru's National Open Data Platform (Plataforma Nacional de Datos Abiertos). Although Peru's open data platform datosabiertos.gob.pe hosts valuable datasets, it can be a challenging for AI agents to find and retrieve the most relevant data for a specific data analysis question. PNDA-MCP simplifies this by providing tools and prompts that let AI agents or any MCP client (such as VS Code or Claude Desktop) easily search for and access datasets metadata, and associated data files. The goal is to enable data scientist agents or code agents to automatically discover and analyze public datasets.
This repository includes the ETL pipeline used to extract, transform, and index dataset titles (see etl folder).
🎬 Demo
<div align="center">
https://github.com/user-attachments/assets/57ca9df2-5d71-4eb3-b868-8dbc6833e7c1
</div>
Demo (Spanish): https://youtu.be/dybtNQP33Sk?si=iA3-iWpn3oRJ9fta
🔧 Tools
| Name | Input | Description |
|---|---|---|
dataset_search |
query, top_k |
Search for relevant datasets from the PNDA (Plataforma Nacional de Datos Abiertos) Peru. query is the search text, top_k limits the number of results returned (max 25). |
dataset_details |
id |
Get dataset details including title, metadata, and resources. Returns complete resource information: direct download URLs, file names, sizes, creation dates, MIME types, formats, states, and descriptions. |
💬 Prompts
| Name | Input | Description |
|---|---|---|
question_generation |
topic |
Generate 5 data analysis questions for any topic using available PNDA datasets. |
analysis_quick |
question |
Create a minimal Jupyter notebook with quick data analysis addressing a question. |
analysis_full |
question |
Create a complete Jupyter notebook with detailed data exploration and analysis addressing a question. |
🚀 How to Use
VS Code (Remote Server)
Note: Requires
npxwhich comes bundled with npm. If you don't have npm installed, install Node.js which includes npm.
The fastest and easiest way to try this MCP is to use the 1-click installation button:
Note: If the MCP tools and prompts do not load immediately, please try restarting VS Code.
Manual installation:
- Open the Command Palette:
View > Command Palette(orCmd+Shift+Pon Mac /Ctrl+Shift+Pon Windows/Linux) - Type and select:
MCP: Add Server... - Choose "Command (stdio)" as the server type
- For "Command to run (with optional arguments)", enter:
npx mcp-remote https://pnda-mcp.onrender.com/mcp - Set the name for the MCP server:
pnda-mcp - Select where to save the configuration: User Settings saves the config globally for all projects. Workspace Settings saves it locally for just the current one.
- Save the configuration
- Restart VS Code for the MCP server to become available.
VS Code (Local Server)
Important: Before running the MCP server locally, you need to:
- Have an OpenAI API key. Get your OpenAI API key from platform.openai.com.
- Have a Pinecone account. If you don't have an account, you can sign up at pinecone.io.
- Configure your OpenAI API key and Pinecone API key in the
.envconfiguration file.- Run the ETL pipeline to index the datasets metadata from PNDA to Pinecone (see the ETL Pipeline section below)
- Open the Command Palette:
View > Command Palette(orCmd+Shift+Pon Mac /Ctrl+Shift+Pon Windows/Linux) - Type and select:
MCP: Add Server... - Choose "Command (stdio)" as the server type
Note: Replace
/path/to/pnda-mcpwith the actual path where you cloned the repository.
- For "Command to run (with optional arguments)", enter:
uv --directory /path/to/pnda-mcp run main.py - Set the name for the MCP server:
pnda-mcp - Select where to save the configuration: User Settings saves the config globally for all projects. Workspace Settings saves it locally for just the current one.
- Save the configuration
- Restart VS Code for the MCP server to become available.
MCP Inspector (Alternative)
Important: Before running the MCP server locally, you need to:
- Have an OpenAI API key. Get your OpenAI API key from platform.openai.com.
- Have a Pinecone account. If you don't have an account, you can sign up at pinecone.io.
- Configure your OpenAI API key and Pinecone API key in the
.envconfiguration file.- Run the ETL pipeline to index the datasets metadata from PNDA to Pinecone (see the ETL Pipeline section below)
Note: Requires
npxwhich comes bundled with npm. If you don't have npm installed, install Node.js which includes npm.
Note: Replace
/path/to/pnda-mcpwith the actual path where you cloned the repository.
Run:
npx @modelcontextprotocol/inspector \
uv \
--directory /path/to/pnda-mcp \
run \
main.py
Open MCP Inspector (URL displayed in the console) and configure the MCP client with the following settings:
- Transport Type: STDIO
- Command:
python - Arguments:
main.py
💡 Examples
| Prompt | Input | Demo | Notebook | Language |
|---|---|---|---|---|
question_generation |
Mining | View Demo | - | English |
analysis_quick |
How has student enrollment at the National University of Engineering evolved between 2017 and 2023 by faculties and degree programs? | View Demo | View Notebook | English |
analysis_full |
What types of fatal accidents are most frequent in the Peruvian mining industry, and in which departments do they occur most often? | View Demo | View Notebook | English |
question_generation |
Minería | View Demo | - | Spanish |
analysis_quick |
¿Cómo ha evolucionado la matrícula de estudiantes en la Universidad Nacional de Ingeniería entre 2017 y 2023 por facultades y carreras? | View Demo | View Notebook | Spanish |
analysis_full |
¿Qué tipos de accidentes mortales son más frecuentes en la industria minera peruana y en qué departamentos ocurren con mayor frecuencia? | View Demo | View Notebook | Spanish |
🏛️ Architecture Diagram
PNDA-MCP follows the Model Context Protocol specification and provides a clean abstraction layer for PNDA.
graph LR
CLIENT[MCP Client<br/>VS Code, Cursor, etc.] --> MCP_SERVER[PNDA-MCP Server]
subgraph TOOLS ["🔧 Tools"]
DATASET_SEARCH[dataset_search]
DATASET_DETAILS[dataset_details]
end
subgraph "💬 Prompts"
QUESTION_GEN[question_generation]
ANALYSIS_QUICK[analysis_quick]
ANALYSIS_FULL[analysis_full]
end
MCP_SERVER --> DATASET_SEARCH
MCP_SERVER --> DATASET_DETAILS
MCP_SERVER --> QUESTION_GEN
MCP_SERVER --> ANALYSIS_QUICK
MCP_SERVER --> ANALYSIS_FULL
DATASET_SEARCH -->|semantic search| PINECONE[Pinecone Vector Database]
DATASET_SEARCH --> OPENAI[OpenAI Text Embeddings API]
DATASET_DETAILS --> CACHE[Cache Layer]
CACHE --> |fallback source| PNDA_API[PNDA API]
CACHE --> |secondary fallback| PINECONE
style CLIENT fill:#e3f2fd
style MCP_SERVER fill:#f3e5f5
style PNDA_API fill:#fff3e0
style PINECONE fill:#fff3e0
style OPENAI fill:#fff3e0
⚙️ ETL Pipeline
Important: The following ETL documentation is only needed if you want to run the MCP locally or deploy your own MCP service. You can use the remote MCP service without running the ETL.
To search datasets using natural language, semantic search with text vector embeddings is used. The ETL pipeline handles the initial indexing and ongoing synchronization of the vector database containing dataset metadata from Peru's National Open Data Platform. It can be run manually or automatically via cron jobs to ensure the dataset information stays up to date.
Requirements
- Docker & Redis: Runs Redis server locally which serves as a message broker and result backend to coordinate tasks during ETL pipeline execution with Celery workers.
- OpenAI API key: The OpenAI Text Embeddings API converts dataset titles into vectors using the
text-embedding-3-smallmodel. Get your OpenAI API key from platform.openai.com. - Pinecone account: Dataset titles are indexed in Pinecone cloud vector database for semantic search. If you don't have an account, you can sign up at pinecone.io.
Setup and Usage
Note: Make sure you have
uvinstalled. If not, install it from uv.tool.
-
Clone and install:
git clone https://github.com/rodcar/pnda-mcp.git cd pnda-mcp uv sync -
Create
.envfileMacOS/Linux:
cp .env.example .envWindows:
copy .env.example .env -
Set your
OPENAI_API_KEYandPINECONE_API_KEYvalues in the.envfile.Note: Get your OpenAI API key from platform.openai.com and your Pinecone API key from app.pinecone.io.
-
Run Redis with Docker
Note: Celery also supports other broker and backend options. See Celery documentation for more details.
docker run -d -p 6379:6379 redis -
Start Celery worker
MacOS/Linux:
./etl/celery_worker.shWindows:
uv run celery -A etl.tasks.app worker --loglevel=infoNote: The Celery worker processes ETL tasks asynchronously. Keep this terminal window open, you'll see task execution logs here when the pipeline runs.
-
Run the ETL pipeline
The pipeline can be executed manually (on-demand) or automated using a cron job for daily execution. It is recommended to perform the initial indexing manually, then use the cron job to maintain data synchronization.
Manual Execution:
python -m etl.pipelineNote: The execution might take several minutes. You can see the logs in the
etl/logs/etl.logfile, and the output files of intermediate ETL tasks in theetl/resultsfolder.Note: You can remove all pending the tasks from the Celery task queue with the following command:
celery -A etl.tasks.app purge -f.Scheduled with Cron Job (MacOS/Linux):
a. Make the script executable:
Note: Replace
/path/to/pnda-mcp/etl/cron.shwith the actual path to thecron.shfile.chmod +x /path/to/pnda-mcp/etl/cron.shb. Edit crontab:
crontab -ec. Add this line (runs daily at 2 AM):
Note: Replace
/path/to/pnda-mcp/etl/cron.shwith the actual path to thecron.shfile.Note: If you are using vim, press
ito enter insert mode and paste the cron job; pressEscto return to normal mode. Use:wqto save and exit.Note: To change the hour replacing the 2 (which means 2 AM) with your desired hour in 24-hour format (e.g., 14 for 2 PM).
0 2 * * * /path/to/pnda-mcp/etl/cron.shd. Verify the cron job was added to the crontab:
crontab -lThe pipeline will execute daily at the time specified in the crontab configuration.
Note: You can see the logs in the
etl/logs/etl.logfile, and the output files of intermediate ETL tasks in theetl/resultsfolder.
ETL Diagram
The following diagram shows the three-stage ETL pipeline that processes dataset metadata from Peru's National Open Data Platform.
flowchart LR
subgraph EXTRACT_WRAPPER["<b>Extract</b>"]
EXTRACT["Fetch complete dataset list from PNDA API"] --> PNDA_API["For each dataset, fetch metadata from PNDA API"]
end
subgraph TRANSFORM_WRAPPER["<b>Transform</b>"]
FILTER["Filter active datasets"] --> STRUCTURE["Format dataset metadata for indexing"]
end
subgraph LOAD_WRAPPER["<b>Load</b>"]
FILTER_CHANGED["Filter datasets with changes*"] --> EMBEDDINGS["Generate embeddings using OpenAI Text Embeddings API"] --> UPSERT["Upsert embeddings to the vector database (Pinecone)"]
end
EXTRACT_WRAPPER e1@==> TRANSFORM_WRAPPER
TRANSFORM_WRAPPER e2@==> LOAD_WRAPPER
e1@{ animate: true }
e2@{ animate: true }
style EXTRACT fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style PNDA_API fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style FILTER fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style STRUCTURE fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style FILTER_CHANGED fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style EMBEDDINGS fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style UPSERT fill:#fff3e0,stroke:#f57c00,stroke-width:2px
*Filters datasets where metadata_modified has changed since the last local version (etl/results/processing_results.json). This means the metadata must be updated in the vector database.
📝 License
This project is licensed under the Apache License 2.0.
<div align="center">
</div>
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.