ibge-microdata-mcp

ibge-microdata-mcp

Local-first MCP server for discovering, downloading, and analyzing official IBGE microdata using DuckDB and Parquet.

Category
Visit Server

README

IBGE Microdata MCP Server

Local-first MCP server for discovering, downloading, unpacking, converting, querying, and analyzing official public IBGE microdata.

This project does not host IBGE datasets. It uses IBGE download servers as the source of truth, downloads only files explicitly requested by the user, mirrors them into a local cache, and runs analysis on local files.

IBGE's raw public download host is ftp.ibge.gov.br. The server accepts both https:// and http:// URLs for that host. It tries to keep official HTTPS URLs where possible, but can fall back to official HTTP for public files when HTTPS is slow or unavailable. Tool output reports the resolved URL, transport used, and SHA-256 hash after download.

Why Local-First

IBGE microdata files are public, but many are large enough that an MCP server should not return them directly in chat responses. The practical workflow is:

discover official files -> inspect size -> download selected archive -> inspect/extract entries -> convert selected variables to Parquet -> query with DuckDB

In plain terms:

  • DuckDB is a local analytical SQL engine. It can query large local files without running a database server.
  • Parquet is a compressed columnar file format. Convert fixed-width TXT microdata once, then query only the columns you need.

Core Generic Tools

Tool Purpose
ibge_microdata_list_surveys List survey families with convenience support.
ibge_microdata_connectivity_check Check whether this machine can reach IBGE download/API endpoints over HTTPS and HTTP.
ibge_microdata_list_files List known public archive files for supported survey families.
ibge_microdata_list_directory List any official ftp.ibge.gov.br directory.
ibge_microdata_discover Bounded crawl of official IBGE directories to find microdata, data, documentation, and layout files.
ibge_microdata_discover_metadata Bounded crawl focused on likely dictionary, layout, input, codebook, and documentation files.
ibge_microdata_file_info Read file size, type, update timestamp, and validators with HTTP HEAD.
ibge_microdata_download_file Download or reuse one official IBGE file in a local cache.
ibge_microdata_list_cache List files already downloaded into a local cache with URLs, paths, sizes, and timestamps.
ibge_microdata_cleanup_cache Preview or delete selected cached files using safe filters.
ibge_microdata_zip_entries List files inside a local ZIP archive without extracting all of it.
ibge_microdata_extract_zip_entry Extract one selected ZIP entry to a local path.
ibge_microdata_inspect_layout Parse a local IBGE fixed-width input layout and search variables.
ibge_microdata_metadata_inventory Parse local dictionaries/layouts or documentation ZIPs into records, variables, positions, types, and value labels.
ibge_microdata_search_variables Search local official metadata for variable names, descriptions, and parsed value labels/categories.
ibge_microdata_export_architecture_csv Export parsed official metadata to a local Base dos Dados-style architecture CSV.
ibge_microdata_export_dictionary_csv Export parsed official value labels/categories to a local Base dos Dados-style dicionario.csv.
ibge_microdata_fixed_width_file_to_parquet Convert a fixed-width TXT file plus official layout into a local Parquet file.
ibge_microdata_fixed_width_zip_to_parquet Convert one fixed-width TXT entry inside a ZIP directly into local Parquet.
ibge_microdata_query_parquet Run bounded read-only DuckDB SQL over local Parquet files exposed as microdata.
ibge_microdata_query_parquet_views Run bounded read-only DuckDB SQL over multiple named Parquet views for joins.
ibge_microdata_weighted_distribution Calculate weighted totals, means, group shares, and top-bracket shares over local Parquet views.
ibge_microdata_describe_parquet_views Inspect schemas, row counts, and sample rows for named Parquet views.
ibge_microdata_profile_parquet_views Profile local Parquet views with row counts, null counts, numeric ranges, frequent values, and samples.
ibge_microdata_validate_recipe Validate a versioned JSON harmonization recipe without writing output.
ibge_microdata_apply_recipe Apply a versioned JSON harmonization recipe and write a derived Parquet file.

The generic path is discovery, caching, official metadata inspection/search, Parquet conversion, profiling, and DuckDB querying. These tools are the main public surface of the server.

Optional Survey-Specific Helpers

These helpers are layered on top of the same local-first workflow. They are useful shortcuts for known IBGE formats, but they are not required for the generic workflow.

Tool Purpose
ibge_microdata_pof_manifest Parse a POF Excel dictionary and map record sheets to data ZIP entries.
ibge_microdata_pof_zip_record_to_parquet Convert one POF record from a Dados ZIP to Parquet using the POF dictionary.
ibge_microdata_pnadc_analyze_file PNAD Contínua convenience summary over an extracted fixed-width TXT file.
ibge_microdata_pnadc_analyze_zip PNAD Contínua convenience summary directly over a TXT entry inside a ZIP.
ibge_microdata_r_status Check local Rscript and required R packages.
ibge_microdata_pnadc_r_download Use PNADcIBGE through R to download PNAD Contínua and write Parquet or RDS.
ibge_microdata_datazoom_pnadc_load Use datazoom.social through R to load PNAD Contínua and save produced files.

Prerequisites

This MCP is a local tool. Users are expected to have:

  • Node.js 18.20 or newer.
  • pnpm.
  • R with Rscript available on PATH.

R is included as a project prerequisite because PNAD Contínua and Data Zoom workflows are best supported by the existing Brazilian R ecosystem. The MCP server itself still runs as a Node/TypeScript process and returns MCP-friendly JSON, Markdown, and local file paths.

Install

  1. Install system runtimes.

On macOS with Homebrew:

brew install node r
npm install -g pnpm@11.7.0

On Windows with winget:

winget install OpenJS.NodeJS
winget install RProject.R
npm install -g pnpm@11.7.0

On Ubuntu/Debian:

sudo apt update
sudo apt install -y nodejs npm r-base
sudo npm install -g pnpm@11.7.0

Linux package managers sometimes ship older Node.js versions. If node --version is below 18.20, install a newer Node.js release from NodeSource, nvm, or the official Node.js installer.

You can also install the same tools from their official installers:

  • Node.js: https://nodejs.org/
  • pnpm: https://pnpm.io/installation
  • R: https://cran.r-project.org/

Check that all three are available:

node --version
pnpm --version
Rscript --version
  1. Clone the GitHub repository:
git clone https://github.com/emmanueltsallis/ibge-microdata-mcp.git
cd ibge-microdata-mcp
  1. Install Node dependencies and build the local MCP server:
pnpm install
pnpm run build
  1. Install the baseline R packages used by the R-backed IBGE workflows:
Rscript -e 'install.packages(c("PNADcIBGE", "survey", "jsonlite", "arrow"), repos = "https://cloud.r-project.org")'
Rscript -e 'install.packages("datazoom.social", repos = c("https://datazoompuc.r-universe.dev", "https://cloud.r-project.org"))'

Run

node dist/index.js

Generic MCP Client Setup

Any MCP client that supports stdio can run this server. After cloning, installing dependencies, and running pnpm run build, configure your client to start the built Node entrypoint.

Use an absolute path to dist/index.js in client configuration. Most clients use the same basic shape:

{
  "mcpServers": {
    "ibge-microdata": {
      "command": "node",
      "args": ["/absolute/path/to/ibge-microdata-mcp/dist/index.js"]
    }
  }
}

If your client uses TOML or another config format, the same idea applies: command is the Node executable, and args contains the absolute path to this server's built dist/index.js file.

Most MCP clients start configured servers when the app or session starts, so restart or reload the client after changing MCP configuration. If the tools do not appear immediately, restart the client first before debugging the server.

MCP uses stdout for protocol messages. This server writes startup and fatal logs to stderr, which keeps stdio protocol traffic clean for clients.

You can smoke-test the server independently of any specific client:

node --input-type=module - <<'NODE'
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

const transport = new StdioClientTransport({
  command: 'node',
  args: ['/absolute/path/to/ibge-microdata-mcp/dist/index.js'],
});
const client = new Client({ name: 'ibge-microdata-smoke', version: '0.0.0' });
await client.connect(transport);
const tools = await client.listTools();
console.log(tools.tools.map((tool) => tool.name).sort().join('\n'));
await client.close();
NODE

The smoke test should list tools such as ibge_microdata_list_surveys, ibge_microdata_metadata_inventory, and ibge_microdata_fixed_width_zip_to_parquet.

For a shorter generic walkthrough, see examples/generic-workflow.md. For a starter harmonization recipe, see examples/harmonization-recipe.json. For external harmonization sources that can inform recipes, see docs/harmonization-sources.md.

R-Backed PNADc Workflow

Use the R status tool first:

ibge_microdata_r_status({})

Download PNAD Contínua through PNADcIBGE and write a Parquet file that the MCP can query with DuckDB:

ibge_microdata_pnadc_r_download({
  "year": 2024,
  "quarter": 4,
  "vars": ["UF", "V1028"],
  "outputPath": "/Users/you/.cache/ibge-microdata-mcp/converted/pnadc_2024q4.parquet"
})

Use datazoom.social when you want Data Zoom's PNAD Contínua processing or panel identifiers:

ibge_microdata_datazoom_pnadc_load({
  "outputDir": "/Users/you/.cache/ibge-microdata-mcp/datazoom/pnadc",
  "years": [2024],
  "quarters": [1, 2, 3, 4],
  "panel": "basic",
  "outputFormat": "parquet"
})

Generic Workflow

  1. Find public files from a known survey family or an official directory:
ibge_microdata_list_surveys({})
ibge_microdata_connectivity_check({})
ibge_microdata_list_files({
  "survey": "pof"
})
ibge_microdata_discover({
  "rootUrl": "https://ftp.ibge.gov.br/",
  "maxDepth": 3,
  "maxDirectories": 50
})
ibge_microdata_discover_metadata({
  "rootUrl": "https://ftp.ibge.gov.br/",
  "maxDepth": 4,
  "maxDirectories": 100
})
  1. Inspect file metadata before downloading:
ibge_microdata_file_info({
  "url": "https://ftp.ibge.gov.br/path/to/public/archive.zip"
})
  1. Download to a local cache:
ibge_microdata_download_file({
  "url": "https://ftp.ibge.gov.br/path/to/public/archive.zip",
  "cacheRoot": "/Users/you/.cache/ibge-microdata-mcp"
})

The downloader mirrors the official ftp.ibge.gov.br path under cacheRoot. On repeated calls, it checks IBGE content-length metadata first and returns a cache hit when the existing local file has the expected byte size.

If HTTPS to ftp.ibge.gov.br times out, the downloader may retry the same public file over http://ftp.ibge.gov.br. This does not send credentials or private data; it only downloads public IBGE files. The response reports transport, usedFallback, and sha256 so the transfer remains auditable in headless MCP use.

  1. List the cache later if you need to rediscover local paths:
ibge_microdata_list_cache({
  "cacheRoot": "/Users/you/.cache/ibge-microdata-mcp",
  "limit": 50,
  "offset": 0
})
  1. Preview cache cleanup when storage grows:
ibge_microdata_cleanup_cache({
  "cacheRoot": "/Users/you/.cache/ibge-microdata-mcp",
  "dryRun": true,
  "olderThanDays": 30,
  "minBytes": 100000000
})

The cleanup tool defaults to dryRun: true, requires at least one filter, and only considers files under cacheRoot/ftp.ibge.gov.br. Set dryRun: false only after reviewing the preview.

  1. Inspect archive contents:
ibge_microdata_zip_entries({
  "zipPath": "/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/public/archive.zip"
})
  1. Inventory official metadata and search variables:
ibge_microdata_metadata_inventory({
  "zipPaths": ["/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/public/documentation.zip"],
  "search": "weight",
  "variableLimit": 50
})
ibge_microdata_search_variables({
  "zipPaths": ["/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/public/documentation.zip"],
  "query": "rendimento",
  "limit": 50
})

The metadata tools parse official SAS/TXT input layouts, POF-style Excel dictionaries, generic Excel dictionary tables, and plain-text dictionary tables with recognizable position/width/variable columns. When available, they return value labels/categories such as state codes or response categories. If a file cannot be parsed, the inventory reports which parser attempts were tried.

  1. Export Base dos Dados-style local documentation CSVs when you want portable metadata files:
ibge_microdata_export_architecture_csv({
  "zipPaths": ["/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/public/documentation.zip"],
  "outputPath": "/Users/you/.cache/ibge-microdata-mcp/extra/architecture/variables.csv"
})
ibge_microdata_export_dictionary_csv({
  "zipPaths": ["/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/public/documentation.zip"],
  "outputPath": "/Users/you/.cache/ibge-microdata-mcp/extra/dicionario.csv"
})

The architecture export writes one row per parsed variable. The dictionary export writes one row per parsed category/code label. These are local documentation artifacts inspired by Base dos Dados workflows; they do not upload anything to BigQuery and do not require Base dos Dados credentials.

  1. Inspect a fixed-width layout directly when you already have the layout file:
ibge_microdata_inspect_layout({
  "layoutPath": "/path/to/official-input-layout.txt",
  "search": "weight",
  "limit": 50
})
  1. Convert selected variables to Parquet:
ibge_microdata_fixed_width_zip_to_parquet({
  "layoutPath": "/path/to/official-input-layout.txt",
  "zipPath": "/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/public/archive.zip",
  "entryName": "MICRODATA.txt",
  "outputPath": "/Users/you/.cache/ibge-microdata-mcp/converted/sample.parquet",
  "selectedVariables": ["record_id", "region", "sample_weight", "target_value"]
})
  1. Profile the Parquet file before writing custom SQL:
ibge_microdata_profile_parquet_views({
  "views": [
    {
      "name": "microdata",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/sample.parquet"]
    }
  ],
  "columns": ["region", "sample_weight", "target_value"],
  "topK": 10,
  "sampleRows": 3
})

If columns is omitted, the tool profiles the first 25 columns by default. This keeps wide microdata files manageable while still giving enough information to choose variables and write queries.

  1. Query the Parquet file with DuckDB:
ibge_microdata_query_parquet({
  "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/sample.parquet"],
  "sql": "select region, sum(sample_weight * target_value) / sum(sample_weight) as weighted_mean from microdata group by region order by region",
  "maxRows": 100
})

The query tools accept only SELECT or WITH queries, reject semicolons and write-oriented keywords, and cap returned rows.

Harmonization Recipes

Recipes are optional JSON files that make harmonization assumptions explicit and reusable. The MCP does not ship one universal harmonization standard; instead, a recipe declares the required input views/columns, an output SELECT transformation, optional source references, and validation checks.

ibge_microdata_validate_recipe({
  "recipePath": "/path/to/harmonization-recipe.json",
  "views": [
    {
      "name": "microdata",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/sample.parquet"]
    }
  ],
  "sampleRows": 5
})

If the validation output says requirements and validations passed, write the harmonized Parquet file:

ibge_microdata_apply_recipe({
  "recipePath": "/path/to/harmonization-recipe.json",
  "views": [
    {
      "name": "microdata",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/sample.parquet"]
    }
  ],
  "outputPath": "/Users/you/.cache/ibge-microdata-mcp/converted/harmonized.parquet",
  "sampleRows": 5
})

Recipe SQL accepts only SELECT or WITH statements. The validation tool reports missing input columns, output schema, sample output rows, and validation results without writing a file. The apply tool writes the harmonized output only when requirements and validations pass.

Weighted Distributions

Use ibge_microdata_weighted_distribution when a Parquet file contains one row per analytical unit, a numeric value column, and a numeric survey/sample weight column:

ibge_microdata_weighted_distribution({
  "views": [
    {
      "name": "microdata",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/sample.parquet"]
    }
  ],
  "unitSql": "select region, target_value as value, sample_weight as weight from microdata",
  "valueColumn": "value",
  "weightColumn": "weight",
  "groupColumn": "region",
  "topPercents": [0.01, 0.05, 0.1]
})

The tool ranks units by the value column, applies weights, reports total weight, total value, weighted mean, optional group shares, and top-bracket shares. If a top bracket cuts through tied values at the cutoff, the tied bucket is allocated proportionally.

Relational Records

Some surveys publish multiple record files. Convert each record to Parquet, inspect the resulting schemas, then join named views:

ibge_microdata_describe_parquet_views({
  "views": [
    {
      "name": "record_a",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/record_a.parquet"]
    },
    {
      "name": "record_b",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/record_b.parquet"]
    }
  ],
  "includeRowCounts": true,
  "sampleRows": 3
})
ibge_microdata_query_parquet_views({
  "views": [
    {
      "name": "record_a",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/record_a.parquet"]
    },
    {
      "name": "record_b",
      "parquetPaths": ["/Users/you/.cache/ibge-microdata-mcp/converted/record_b.parquet"]
    }
  ],
  "sql": "select a.region, count(*) as rows from record_a a join record_b b using (record_id) group by a.region order by a.region",
  "maxRows": 100
})

POF Dictionaries

POF editions use Excel dictionary workbooks. Use the manifest tool to map dictionary sheets to TXT entries before converting records:

ibge_microdata_pof_manifest({
  "dictionaryPath": "/path/to/dictionary.xls",
  "dataZipPath": "/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/Dados.zip",
  "search": "weight",
  "variableLimit": 20
})
ibge_microdata_pof_zip_record_to_parquet({
  "dictionaryPath": "/path/to/dictionary.xls",
  "zipPath": "/Users/you/.cache/ibge-microdata-mcp/ftp.ibge.gov.br/path/to/Dados.zip",
  "recordName": "Domicílio",
  "outputPath": "/Users/you/.cache/ibge-microdata-mcp/converted/pof_record.parquet",
  "selectedVariables": ["UF", "ESTRATO_POF", "TIPO_SITUACAO_REG"]
})

The POF converter applies implied decimal scaling from the dictionary and writes DuckDB-queryable Parquet files.

Tests

Offline unit tests:

pnpm test

Live smoke tests against official IBGE endpoints:

RUN_IBGE_SMOKE=1 pnpm test -- tests/smoke.test.ts

Smoke tests list official directories, read HEAD metadata, and download the smaller POF documentation ZIP to verify dictionary parsing. They do not download large microdata data ZIPs.

Local R setup smoke test:

RUN_R_SMOKE=1 pnpm test -- tests/r-smoke.test.ts

The R smoke test checks Rscript and baseline R package availability. It does not download PNAD microdata.

Current Limits

  • This is a local-first MCP server, not a hosted warehouse of all IBGE microdata.
  • Discovery is deliberately bounded; broad root crawls should use explicit maxDepth and maxDirectories values to avoid excessive requests.
  • Generic fixed-width conversion, official metadata inventory/search, Parquet profiling/querying, weighted distribution summaries, and POF dictionary conversion are implemented.
  • Metadata parsing supports common IBGE SAS/TXT input layouts, POF-style Excel dictionaries, generic Excel dictionary tables, and generic TXT dictionary tables. PDF-only/prose-only codebooks and highly irregular historical layouts may still need parser adapters.
  • Additional survey-specific harmonized recipes can be added as optional layers without changing the generic workflow.

License

MIT. See LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured