MCP Servers

Veritas

An AI-powered data investigation workspace for Product Managers, combining structured investigation workflows with a collectively maintained three-tier data catalog.

README

Veritas

An AI-powered data investigation workspace for Product Managers, combining structured investigation workflows with a collectively maintained three-tier data catalog.

What This Is

Two things working together:

Investigation MCP Server -- a custom MCP server that provides investigation lifecycle tools (start, continue, capture, close), three-tier catalog management (local/community/curated), and a self-improving memory system
Data Investigation Framework -- a structured workflow for running multi-session analyses with persistent context, findings tracking, and institutional knowledge that compounds over time
Agent Memory System -- behavioral learning that enables the AI agent to remember corrections, preferences, and workflow patterns across sessions, with a promotion pipeline for turning recurring patterns into permanent rule improvements

Data access (SQL queries, schema discovery, profiling) is handled by the official Databricks MCP server, which runs alongside the investigation MCP server. This separation means adding new data sources (Dynatrace, Amplitude, etc.) is just adding another MCP server entry.

Quick Start

Setup takes ~5 minutes. See GETTING_STARTED.md for the full guide with screenshots, or follow the quick version:

1. Clone and run setup

git clone https://github.com/hertzcorp/htzd-databricks-mcp.git
cd htzd-databricks-mcp

Mac / Linux: ./setup.sh | Windows (PowerShell): .\setup.ps1

The script checks prerequisites, installs dependencies, then asks you a few questions: your name, email, Databricks HTTP path, and a personal access token. It shows you the exact links to copy each value. One run — no editing files.

2. Restart Cursor and approve MCP servers

Restart Cursor (Mac: Cmd+Q | Windows: Alt+F4, then reopen). When it starts, click "Start" on the two MCP server prompts at the bottom of the screen. This is a one-time approval -- all tools are pre-approved after that.

3. Start asking questions

Open the Chat panel (Mac: Cmd+Shift+I | Windows: Ctrl+Shift+I) and try:

"What tables do we know about?" -- browse the team's data catalog
"Start investigation" -- begin a tracked multi-session analysis
"What data do we have on billing?" -- search the catalog for relevant knowledge
"Deep investigate: why is contact rate rising?" -- autonomous recursive investigation
"Deep index" -- crawl Databricks lineage to build the table knowledge base

See docs/ONE-PAGER.md for a quick reference of all features with end-to-end examples.

Project Structure

.
├── src/investigation_mcp/          # Investigation MCP server
│   ├── server.py                   # MCP server entry point (24 tools)
│   └── tools/                      # Tool implementations
│       ├── investigation.py        # Start, continue, capture, close investigations
│       ├── catalog_index.py        # Catalog reading: index, entries, health, context
│       ├── catalog_local.py        # Local entry capture + preferences
│       ├── catalog_sharing.py      # Share local entries to community via PR
│       ├── catalog_curation.py     # Curator tools: review, approve, reject
│       ├── memory.py               # Agent memory: capture, load, reinforce, review, apply
│       ├── _git_helpers.py         # Shared git utilities
│       └── catalog_management.py   # Legacy catalog tools (deprecated)
├── data-catalog/                       # Three-tier data catalog
│   ├── curated/                        # Reviewed, authoritative knowledge (team-shared)
│   ├── community/                      # Unvetted contributions from team members (team-shared)
│   ├── local/                          # Personal notes, auto-captured (gitignored)
│   ├── _deep_index/                    # Pre-computed table knowledge base (team-shared)
│   ├── _schema.yaml                    # Entry type definitions
│   └── _curators.yaml                  # Approved catalog curators
├── investigations/                     # Tracked analyses (gitignored except templates + index)
│   └── _templates/                     # Templates for new investigations
├── docs/
│   ├── hertzcorp-architecture/         # Hertz system architecture references
│   ├── adoption/                       # Adoption strategy and growth plans
│   ├── workshop/                       # Workshop facilitator guide and slide assets
│   ├── ONE-PAGER.md                    # Quick feature reference
│   ├── SPEC.md                         # Technical specification
│   └── PATTERNS.md                     # Architectural patterns
├── .cursor/
│   ├── memory/                         # Agent behavioral memory (gitignored)
│   │   ├── corrections/                # Behavioral corrections (YAML)
│   │   ├── preferences/                # Style and output preferences (YAML)
│   │   ├── patterns/                   # Detected workflow patterns (YAML)
│   │   └── reflections/                # Session retros (Markdown)
│   ├── rules/                          # Cursor rules
│   │   ├── analysis-principles.mdc     # 8 data analysis principles
│   │   ├── memory.mdc                  # Agent memory behavior (when to capture/load/cite)
│   │   ├── pm-guardrails.mdc           # Prevents AI from modifying protected files
│   │   ├── first-run.mdc              # Welcome experience for new users
│   │   └── ...                         # Development workflow rules
│   ├── hooks/
│   │   └── memory-flush.py             # Advisory: remind to capture learnings at session end
│   └── skills/
│       └── data-investigation/         # Investigation workflow skill
├── ideation/                       # Product concepts and drafts (gitignored)
├── scripts/                        # Standalone analysis scripts
├── .github/
│   └── workflows/
│       └── catalog-validation.yml  # CI: validates catalog entries on PR
├── setup.sh                        # Onboarding wizard
├── GETTING_STARTED.md              # PM-friendly setup guide
└── pyproject.toml                  # Package: investigation-toolkit

Three-Tier Data Catalog

The catalog is the institutional knowledge layer. It grows with every investigation.

Tier	Location	Shared?	Trust Level	How It Gets There
Local	`data-catalog/local/`	No (gitignored)	Personal notes	AI auto-captures during investigations
Community	`data-catalog/community/`	Yes (via PR)	Unvetted	PM shares via `share_to_catalog` tool
Curated	`data-catalog/curated/`	Yes (via PR)	Authoritative	Curator promotes from community

Knowledge lifecycle:

Investigation discovers knowledge
  -> AI auto-captures to local/ (continuous, silent)
  -> PM selects what to share ("share this with the team")
  -> share_to_catalog creates PR to community/ (CI validates, approver merges)
  -> Curator reviews community entries (review_pending)
  -> Curator promotes to curated/ (approve_entry) or rejects (reject_entry)

The AI reads all three tiers with trust ordering: curated (high confidence) > community (hints, verified before use) > local (personal recall).

Investigation Framework

How It Works

Say "start investigation" in Cursor chat to begin a tracked analysis. The framework:

Discovery -- loads the catalog, refines the question, finds the right tables
Analysis -- runs queries via Databricks MCP, auto-captures discoveries to local catalog
Validation -- samples raw records, confirms field semantics, checks system behavior
Capture -- saves findings, presents unshared catalog entries for sharing
Close -- marks complete, prompts for catalog sharing

Investigations persist across sessions. Say "continue investigation" to resume where you left off.

Agent Memory System

The agent learns from your corrections and preferences over time through two paths:

Soft path (immediate): Corrections and preferences are saved to .cursor/memory/ and loaded into context at the start of future sessions. The agent adjusts its behavior based on past learnings without any structural changes.

Hard path (permanent): When a correction recurs 3+ times across sessions, the agent detects the pattern and proposes a permanent change to the workflow rules or investigation skill. You approve before anything changes. Platform-scope proposals create a PR for team review.

What	How
Correct the agent	It calls `capture_memory` automatically -- remembered next session
Same correction 3x	Pattern detected, proposed as a rule change at session end
Approve a proposal	`apply_proposal` edits the rule file (personal) or creates a PR (platform)
Wrong promotion	Agent detects contradiction and proposes rollback
Check system health	`memory_stats` shows counts, averages, maintenance status

Memory is personal and gitignored. Each PM's agent learns independently. Improvements that benefit everyone flow through the PR-based promotion pipeline.

Analysis Principles

The .cursor/rules/analysis-principles.mdc rule enforces eight principles:

Ground truth first -- sample raw records before aggregating
Profile before joining -- verify field contents, don't assume from names
Ask about system behavior early -- domain knowledge shapes data interpretation
Correlation is not causation -- prevalence on a record does not equal cause
Explicit assumptions -- every volume estimate needs a stated deflection rate
Hypotheses before conclusions -- present as "data suggests" until validated
Flag observability gaps -- generic error codes need decomposition, not sizing
Signal catalog trust level -- indicate whether knowledge is curated, community, or local

Commands Reference

Investigation Commands

Command	What It Does
`start investigation`	Create a new tracked investigation
`continue [name]`	Resume a previous investigation with full context
`capture findings`	Save findings and offer to share catalog entries
`close investigation`	Mark complete, prompt for final catalog sharing
`list investigations`	Show active and recent investigations

Catalog Commands

Command	What It Does
`What tables do we know about?`	Load catalog index -- shows all entries across tiers
`What do we know about [topic]?`	Load entries for a specific domain or subject
`share this with the team`	Share a local catalog entry to community via PR
`catalog health`	Show catalog metrics: backlog, coverage, stale entries
`retry pending shares`	Retry shares that failed due to connectivity

Curator Commands

Command	What It Does
`review catalog`	Show pending community entries grouped by domain
`approve [entry]`	Promote a community entry to curated (curator-only)
`reject [entry]`	Remove a community entry with optional reason

MCP Tools Reference

Investigation MCP (Custom)

Tool	Description
`start_investigation`	Create investigation folder with templates
`continue_investigation`	Load investigation context and resume
`capture_findings`	Save findings, prompt for catalog sharing
`close_investigation`	Mark complete, final catalog check
`list_investigations`	List active/completed investigations
`load_catalog_index`	Lightweight index of all catalog entries
`get_catalog_entries`	Full content for entries matching filter
`get_data_context`	Read curated domain knowledge
`catalog_health`	Catalog metrics and health summary
`add_local_entry`	Capture a discovery to local catalog
`share_to_catalog`	Share local entry to community via PR
`retry_pending_shares`	Retry failed shares from offline sessions
`review_pending`	Summarize unreviewed community entries
`approve_entry`	Promote community entry to curated (curator-only)
`reject_entry`	Remove community entry (curator-only)
`get_preferences`	Read learned preferences
`save_preference`	Save a preference for future sessions
`capture_memory`	Capture a correction, preference, or pattern to memory
`load_relevant_memories`	Load scored, filtered memories for current context
`reinforce_memory`	Reinforce an existing memory when same correction recurs
`learning_review`	Generate end-of-session learning summary with proposals
`capture_reflection`	Write a structured session retro to memory
`apply_proposal`	Apply approved proposal to rule/skill file (or create PR)
`memory_stats`	Report memory system health and run maintenance

Official Databricks MCP (Data Access)

Data access tools are provided by the official Databricks MCP server:

Tool	Description
`list_catalogs`	List Unity Catalog catalogs
`list_schemas`	List schemas in a catalog
`list_tables`	List tables in a schema
`describe_table`	Get column details
`execute_sql`	Run SQL queries
`sample_data`	Get sample rows
`profile_table`	Generate data quality profile

Tips for Better Results

At the start of an investigation:

The AI loads the catalog automatically. For new domains without curated entries, tell the AI what tables to use.
Share system context early -- how key systems behave, known data quirks, business rules.
State a hypothesis: "I think X is caused by Y" is more productive than "Tell me about X."

During analysis:

Challenge aggregate findings. Ask "Have you looked at actual records?"
When the AI presents a volume estimate, ask about the assumption: "What deflection rate?"
The AI auto-captures gotchas, joins, and terms to your local catalog as it discovers them.

At the end of a session:

Say "capture findings" to save progress. The AI presents unshared catalog entries for sharing.
Choose which discoveries to share with the team. The AI handles the PR.

For executive deliverables:

Ask the AI to create a separate brief rather than editing findings.md.
Push back on oversimplified narratives. Ask for sub-problems.
Request an assumptions and limitations section.

Extending with Additional MCP Servers

Adding a new data source is just adding an entry to .cursor/mcp.json.template:

{
  "mcpServers": {
    "dynatrace": {
      "command": "npx",
      "args": ["-y", "@dynatrace-oss/dynatrace-mcp-server@latest"],
      "env": {
        "DT_URL": "{{DT_URL}}",
        "DT_TOKEN": "{{DT_TOKEN}}"
      }
    }
  }
}

The investigation MCP and catalog are source-agnostic. The catalog's system field tracks which data source knowledge came from.

Running Scripts

Legacy standalone analysis scripts can be run from the repo root. Note: some older scripts may reference the previous databricks_mcp package which has been renamed to investigation_mcp. See the scripts directory for details.

Troubleshooting

"MCP server not showing in Cursor"

Run ./setup.sh --test-only to validate configuration
Restart Cursor completely (Cmd+Shift+P > "Reload Window")
Check .cursor/mcp.json exists and has correct paths

"Setup fails at connection test"

Verify your Databricks workspace URL includes https://
Check that your personal access token hasn't expired
Ensure your SQL Warehouse is running
Try ./setup.sh --no-test to skip validation, fix credentials in .env, then ./setup.sh --test-only

"Table not found"

Check the full table path (catalog.schema.table)
Use load_catalog_index to see what the team knows about
Verify you have access to the table in Databricks

Security Notes

Databricks tokens stored in .env (gitignored, never committed)
Local catalog entries are gitignored (personal to each user)
Community catalog entries go through PR review before merging
CI scans catalog PRs for secrets and PII patterns
PM guardrails rule prevents AI from modifying protected files (rules, skills, curated catalog) -- exception: apply_proposal during approved learning reviews
Agent memory files are gitignored (personal behavioral learnings never pushed to remote)
Memory promotions to rule files require explicit human approval; platform-scope changes go through PR review
Curator authorization enforced programmatically via _curators.yaml

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured