Cloakbox
Let LLMs analyze sensitive data safely by querying a tokenized, join-preserving copy of the database, with fail-closed PII scanning and provable numeric equivalence.
README
Cloakbox
Let any LLM analyze your sensitive data — without ever showing it a real identity.
Cloakbox builds a sanitized, analysis-ready copy of your database in which people are
replaced by stable, join-preserving tokens. An LLM queries the copy (read-only) and
sees tokens like SUB_2c17917e63b5 instead of names. A separate, isolated, human-only
tool can re-identify when a person genuinely needs to — and every reversal is audited.
Built on DuckDB + the Model Context Protocol. Reuses mature building blocks; see docs/04-prior-art.md.
VAULT (real data) ──build──► CLOAKBOX (tokens, no PII) ──read-only MCP──► LLM
read-only │
└──► MAPPING (isolated) ◄── manual, human-only decoder
Why it's different
Most "PII firewalls" redact text in flight as the AI hits real data — a detection
miss is a live leak. Cloakbox inverts that: it pre-sanitizes the whole warehouse
with an explicit, fail-closed policy, then lets the AI roam the clean copy freely.
And it proves no analytical value was lost: equivalence_check.py shows reports
return identical numbers on the vault and the box.
| Runtime PII proxy | Cloakbox | |
|---|---|---|
| Basis | Detection (miss = leak) | Explicit per-column policy + fail-closed scan |
| Cross-table joins | Best-effort | Preserved by deterministic tokens |
| Correctness | — | Equivalence proof in CI |
| Re-identification | Often inline/automatic | Isolated, manual, human-only, audited |
Quickstart (synthetic data, ~1 minute)
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cd pipeline
python3 make_example_vault.py # generate a fake vault
python3 build_cloakbox.py init # create the secret salt (once)
python3 build_cloakbox.py build # vault -> sanitized cloakbox + isolated mapping
python3 build_cloakbox.py validate # fail-closed PII + k-anonymity scan
python3 equivalence_check.py # prove the box == the vault, numerically
Full walkthrough: docs/quickstart.md.
See it run
The build is fail-closed and the result is provably equivalent to the source:
$ python3 build_cloakbox.py validate
== Residual PII scan (emails) ==
OK — no email-shaped values found.
== Token format check ==
(checked all policy-tokenized columns)
== k-anonymity report (k=5) ==
ok students: 0 QI-groups below k on (grade_level, campus)
WARN enrollments: 403 QI-groups below k on (course_id, grade, campus)
VALIDATION PASSED
$ python3 equivalence_check.py
== 4a) Aggregate report equivalence (numbers must match exactly) ==
IDENTICAL passed assessments per course
IDENTICAL distinct subjects per campus
IDENTICAL enrollments joined to courses, count per subject area
== 4b) Identity-labelled report (vault relabelled via mapping == box) ==
IDENTICAL distinct subjects per teacher
EQUIVALENCE PASSED — Cloakbox reproduces the vault's output exactly
How it works
- Tokenize, deterministically.
PREFIX_ + sha256(salt || domain || value). Same input → same token, so joins and distinct-counts survive; one-way, so the box can't be reversed. (policy) - Fail closed.
validatescans for residual emails and malformed tokens and reports k-anonymity violations; a leak blocks the build. - Gate read-only. A DuckDB MCP server points only at
cloakbox.duckdb. (gateway template) - Re-identify out-of-band. The decoder is a manual, isolated, audited CLI — never reachable by the model.
Layout
pipeline/ build engine, policy config, equivalence check, synthetic-data generator
decode/ isolated, human-only re-identification tool
gateway/ read-only MCP config + agent guardrail rule (templates)
docs/ architecture, anonymization policy, security, prior art, decoder, quickstart
Point it at your own data
Edit pipeline/cloakbox_config.py: set the vault path
and adjust the column rules and report definitions to your schema. Re-run build +
equivalence. Never commit real data or the salt — see .gitignore.
Security
Read docs/03-security.md. Key point: a read-only DuckDB
connection blocks writes but does not sandbox the filesystem — the real
isolation is OS file permissions keeping the vault, salt, and mapping out of the
gateway's reach (pipeline/secure_paths.sh). This is a pattern, not a compliance
certification; have counsel review regulated deployments.
License
MIT — see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.