Cloakbox

Cloakbox

Let LLMs analyze sensitive data safely by querying a tokenized, join-preserving copy of the database, with fail-closed PII scanning and provable numeric equivalence.

Category
Visit Server

README

Cloakbox

Let any LLM analyze your sensitive data — without ever showing it a real identity.

Cloakbox builds a sanitized, analysis-ready copy of your database in which people are replaced by stable, join-preserving tokens. An LLM queries the copy (read-only) and sees tokens like SUB_2c17917e63b5 instead of names. A separate, isolated, human-only tool can re-identify when a person genuinely needs to — and every reversal is audited.

Built on DuckDB + the Model Context Protocol. Reuses mature building blocks; see docs/04-prior-art.md.

 VAULT (real data)  ──build──►  CLOAKBOX (tokens, no PII)  ──read-only MCP──►  LLM
   read-only                          │
                                      └──►  MAPPING (isolated)  ◄── manual, human-only decoder

Why it's different

Most "PII firewalls" redact text in flight as the AI hits real data — a detection miss is a live leak. Cloakbox inverts that: it pre-sanitizes the whole warehouse with an explicit, fail-closed policy, then lets the AI roam the clean copy freely. And it proves no analytical value was lost: equivalence_check.py shows reports return identical numbers on the vault and the box.

Runtime PII proxy Cloakbox
Basis Detection (miss = leak) Explicit per-column policy + fail-closed scan
Cross-table joins Best-effort Preserved by deterministic tokens
Correctness Equivalence proof in CI
Re-identification Often inline/automatic Isolated, manual, human-only, audited

Quickstart (synthetic data, ~1 minute)

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cd pipeline
python3 make_example_vault.py     # generate a fake vault
python3 build_cloakbox.py init     # create the secret salt (once)
python3 build_cloakbox.py build    # vault -> sanitized cloakbox + isolated mapping
python3 build_cloakbox.py validate # fail-closed PII + k-anonymity scan
python3 equivalence_check.py      # prove the box == the vault, numerically

Full walkthrough: docs/quickstart.md.

See it run

The build is fail-closed and the result is provably equivalent to the source:

$ python3 build_cloakbox.py validate
== Residual PII scan (emails) ==
  OK — no email-shaped values found.
== Token format check ==
  (checked all policy-tokenized columns)
== k-anonymity report (k=5) ==
  ok   students: 0 QI-groups below k on (grade_level, campus)
  WARN enrollments: 403 QI-groups below k on (course_id, grade, campus)
VALIDATION PASSED

$ python3 equivalence_check.py
== 4a) Aggregate report equivalence (numbers must match exactly) ==
  IDENTICAL          passed assessments per course
  IDENTICAL          distinct subjects per campus
  IDENTICAL          enrollments joined to courses, count per subject area
== 4b) Identity-labelled report (vault relabelled via mapping == box) ==
  IDENTICAL          distinct subjects per teacher
EQUIVALENCE PASSED — Cloakbox reproduces the vault's output exactly

How it works

  1. Tokenize, deterministically. PREFIX_ + sha256(salt || domain || value). Same input → same token, so joins and distinct-counts survive; one-way, so the box can't be reversed. (policy)
  2. Fail closed. validate scans for residual emails and malformed tokens and reports k-anonymity violations; a leak blocks the build.
  3. Gate read-only. A DuckDB MCP server points only at cloakbox.duckdb. (gateway template)
  4. Re-identify out-of-band. The decoder is a manual, isolated, audited CLI — never reachable by the model.

Layout

pipeline/   build engine, policy config, equivalence check, synthetic-data generator
decode/     isolated, human-only re-identification tool
gateway/    read-only MCP config + agent guardrail rule (templates)
docs/       architecture, anonymization policy, security, prior art, decoder, quickstart

Point it at your own data

Edit pipeline/cloakbox_config.py: set the vault path and adjust the column rules and report definitions to your schema. Re-run build + equivalence. Never commit real data or the salt — see .gitignore.

Security

Read docs/03-security.md. Key point: a read-only DuckDB connection blocks writes but does not sandbox the filesystem — the real isolation is OS file permissions keeping the vault, salt, and mapping out of the gateway's reach (pipeline/secure_paths.sh). This is a pattern, not a compliance certification; have counsel review regulated deployments.

License

MIT — see LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured