cuad-audit
An MCP server that audits a contract liability clause against a derived company standard, producing a verdict only when grounded in retrieved evidence and passing a faithfulness check. It abstains with 'insufficient-grounding' when evidence is too weak.
README
cuad-audit
An MCP server that audits a contract liability clause against a derived
company standard, and only produces a verdict when it can point at the
exact retrieved evidence it relied on, and that verdict has passed a
faithfulness check against that evidence. When the evidence is too weak,
it says so — insufficient-grounding is a first-class result, not an error.
This is a study in when an agent should abstain, built as a small, fully-tested MCP server over a real legal-contracts dataset (CUAD, CC BY 4.0).
Quickstart
uv sync
make demo
make demo runs two pinned fixtures through audit_clause and prints the
raw tool JSON — no MCP wiring, no full dataset download (the Chroma index
auto-builds in seconds from a committed slice of the data). Without
ANTHROPIC_API_KEY set, it runs the abstain case only (gate 1 needs no
LLM) and tells you so:
{
"verdict": "insufficient-grounding",
"reason": "retrieval evidence too weak to ground a verdict (top similarity 0.423 < threshold 0.64) — escalate to a human reviewer",
"citation": null,
"gate1_score": 0.4228,
"gate1_threshold": 0.64,
"faithfulness": null,
"failure_cause": "gate1"
}
With a key set, it also runs a cited verdict case (a mutual 3x
work-order liability cap) and prints acceptable/risky/off-standard
with a chunk_id citation the server resolved to an exact precedent span.
Architecture
┌──────────────────┐
agent ──▶│ MCP server │ stdio; logging to stderr only
│ (server.py) │ (stdout is reserved for the protocol)
└──┬──────┬─────┬──┘
search_clauses get_standard audit_clause (reuses both)
│ │ │
▼ ▼ ▼
┌──────────┐ ┌───────────┐ ┌─────────────────────────────┐
│BM25 + │ │standard, │ │ GATE 1 (pre-LLM): cosine │
│cosine │ │derived from│ │ evidence score < 0.64 │
│(Chroma) │ │15 read │ │ → insufficient-grounding │
│ │ │clauses │ │ verdict LLM (Sonnet, t=0) │
└────┬─────┘ └───────────┘ │ GATE 2 (post-LLM): Haiku │
│ │ faithfulness judge │
▼ └─────────────────────────────┘
CUAD liability spans (Cap On Liability / Uncapped Liability)
→ chunked with stable chunk_ids
The three tools
search_clauses(query, clause_type="liability", k=5)— BM25-ranked precedent chunks (each with a stablechunk_id, source contract, char span, and score), gated by a cosine evidence-confidence score. Below threshold returnsstatus: "below_threshold"with the scores — an abstention, not an error.get_standard(clause_type="liability")— the liability "playbook": six positions (P1–P6, e.g. mutuality, cap basis, carve-outs) derived from 15 hand-read CUAD clauses, with provenance. Explicitly scoped — not legal advice, not corpus-wide extraction.audit_clause(incoming_clause, clause_type="liability")— reuses both tools above, then runs the two-gate grounding contract below.
The grounding contract (two gates, both in code)
- Gate 1 — pre-LLM evidence gate. A leave-one-out cosine similarity
score is checked before any LLM call. If the best match is below
0.64 (calibrated against a 13-query negative set — gibberish,
out-of-domain clauses, cross-referenced caps), the tool returns
insufficient-groundingimmediately. No API key needed for this path. - Gate 2 — post-LLM faithfulness judge. A single cheap Haiku call
decomposes the verdict's reasoning into claims and checks each is
supported by the cited chunk and the standard. An unfaithful verdict is
downgraded to
insufficient-groundingand counted as a hallucination — never silently shipped.
Citations are chunk_id lookups, not string matching: the verdict LLM
picks an id from the evidence it was shown, and the server resolves it to
the exact span. If the LLM names an invalid verdict or a chunk_id outside
the retrieved evidence, that's also caught and downgraded to
insufficient-grounding.
escalate-infra (API timeout/rate-limit/malformed output/refusal) is a
separate verdict from insufficient-grounding — "the system is honest"
and "the API is flaky" are never conflated.
Connect to Claude Code
claude mcp add cuad-audit -- uv run --directory /path/to/luminance python -m cuad_audit.server
claude mcp list # health check — should show cuad-audit as connected
Then ask Claude Code to call search_clauses, get_standard, or
audit_clause. Tool descriptions document the abstain semantics — agents
should relay insufficient-grounding / below_threshold / escalate-infra
verbatim, not retry until they get a verdict.
| Symptom | Likely cause | Fix |
|---|---|---|
Server doesn't appear in claude mcp list |
wrong --directory or uv not on PATH |
run uv run python -m cuad_audit.server directly from the repo root and check stderr |
audit_clause always returns escalate-infra |
ANTHROPIC_API_KEY not set |
export ANTHROPIC_API_KEY=... (search_clauses/get_standard still work without it) |
| First call is slow / looks hung | first-run embedding model download (~90 MB) | progress prints to stderr; subsequent runs are cached |
Setup
- Python 3.11+, dependency management via
uvwith a committed lockfile (uv sync). - Embedding model:
sentence-transformers/all-MiniLM-L6-v2(~90 MB, downloaded once and cached). - Tested on macOS and Linux.
- First
make demo/make ingest: ~1–2 minutes (model download + index build from the committed slice of 680 chunks). Subsequent runs: seconds. Measured smoke test (uv sync && make demo && make eval-retrieval && make test, warm model cache): 97s total, 33 passed + 1 skipped (the skipped test needsANTHROPIC_API_KEY). ANTHROPIC_API_KEY— only required foraudit_clause's verdict + judge calls andmake eval-verdicts. Not required forsearch_clauses,get_standard, the abstain half ofmake demo,make eval-retrieval, ormake calibrate.
Reproducing the measurements
make eval-retrieval # keyless, deterministic — retrieval vs CUAD spans + kill criteria
make calibrate # keyless — gate-1 threshold calibration distributions
make eval-verdicts # needs ANTHROPIC_API_KEY — ~70 LLM calls, a few dollars
Retrieval vs CUAD expert spans (held-out split, docs/DAY2_RESULTS.md)
166 held-out queries against 680 library chunks (80% Cap On Liability / 20% Uncapped Liability):
| Metric | BM25 | MiniLM (cosine) |
|---|---|---|
| precision@1 (overall) | 0.741 | 0.705 |
| precision@1 (Uncapped Liability, n=34) | 0.559 | 0.500 |
| success@3 (≥1 relevant in top 3) | 0.970 | 0.970 |
Embeddings did not beat the keyword baseline, so retrieval ranking and citations are BM25-first (pre-registered Day-2 rule); the semantic index stays in the repo, tested, as the measured comparison and feeds gate 1.
Gate-1 threshold (docs/DAY3_CALIBRATION.md)
Calibrated on leave-one-out cosine top-scores: library positives (n=669, p10=0.64) vs a 13-query negative set (gibberish, out-of-domain clauses, cross-referenced caps). Threshold 0.64 catches 8/13 negatives outright; the remaining 5 (cross-referenced caps, near-domain insurance/audit-rights text) are real liability-adjacent text with no auditable content — owned by the verdict path (standard position P5), not by gate 1. See docs/gate1_calibration.png.
Verdicts vs hand-labeled standard (docs/EVAL.md, results: docs/DAY5_RESULTS.md)
35-item eval set (28 hand labels + 3 cross-reference cases + 3
capped/uncapped confusion pairs + 1 prompt-injection probe), run via
make eval-verdicts. Reported as counts and a failure taxonomy, never a
headline accuracy — n is too small for that, and the report says so. Columns
distinguish grounding abstentions (justified vs unjustified) from infra
abstentions (API failures).
One live run (2026-06-10; ±1–2 expected on re-run):
| n | % of 28 | |
|---|---|---|
| Non-abstained verdict | 11 | 39% |
| Grounding abstain — justified (R9) | 1 | 4% |
| Grounding abstain — unjustified | 16 | 57% |
| Infra abstain | 0 | 0% |
6/11 non-abstained verdicts matched the hand label exactly. Of the 16 unjustified abstentions, 14 came from the gate-2 faithfulness judge — a hand-verified sample found the judge, not the verdict LLM, was usually the weak link (rejecting reasonable inferential claims as "unsupported"). Citations were valid on 100% of non-abstained verdicts; adversarial defenses (cross-referenced caps + prompt-injection probe) held 4/4. Full taxonomy, root-cause analysis, and "what I'd do next": docs/DAY5_RESULTS.md. One-page project writeup: docs/WRITEUP.md.
Known limitations (named on purpose)
- Polarity risk: "Cap on Liability" and "Uncapped Liability" are a negation pair that embedding similarity can confuse. Gate 1 measures evidence strength, not correctness — the verdict LLM owns the capped/uncapped call, and confusion pairs are in the adversarial eval set.
- Tool-level vs agent-level grounding: the server cannot stop a client
agent from speculating after an
insufficient-groundingresult. The demo harness instructs verbatim relay and shows raw tool output. - Single lane (liability), single segmenter (CUAD's expert spans, not a production clause segmenter), standard derived from 15 read clauses — all scoped claims, not corpus-wide extraction. See PLAN.md for the full "Not in Scope" list and rationale.
- The faithfulness judge (gate 2) is itself an ungated LLM call; a hand-verified sample of judge outputs is reported alongside the eval.
Project layout
src/cuad_audit/
download.py CUAD v1 download (pinned sha256)
derive_slice.py reproduces the committed data slice byte-identically
ingest.py chunking, token-length checks, Chroma index build
retrieval.py BM25 (KeywordIndex) + cosine (SemanticIndex)
llm.py CompleteFn seam — typed failures, no silent fallbacks
audit.py the three tools + both gates
server.py MCP stdio entrypoint
demo.py make demo
calibrate.py gate-1 threshold calibration
eval_retrieval.py make eval-retrieval
eval_verdicts.py make eval-verdicts (resumable JSONL)
data/ committed slice (split, standard, labels, chunks)
docs/ split methodology, rubric, eval definitions, results
tests/ 34 tests, LLM seam fully mocked — CI is free
Data & attribution
Built on the Contract Understanding Atticus Dataset (CUAD) v1,
© The Atticus Project, licensed under
CC BY 4.0. This repo commits
a small derived slice (liability-clause spans, data/liability_spans_all.json
and data/split.json) for reproducibility; make ingest can re-derive the
index from a fresh download via src/cuad_audit/download.py.
Code is MIT licensed.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.