cuad-audit

cuad-audit

An MCP server that audits a contract liability clause against a derived company standard, producing a verdict only when grounded in retrieved evidence and passing a faithfulness check. It abstains with 'insufficient-grounding' when evidence is too weak.

Category
Visit Server

README

cuad-audit

CI License: MIT

An MCP server that audits a contract liability clause against a derived company standard, and only produces a verdict when it can point at the exact retrieved evidence it relied on, and that verdict has passed a faithfulness check against that evidence. When the evidence is too weak, it says so — insufficient-grounding is a first-class result, not an error.

This is a study in when an agent should abstain, built as a small, fully-tested MCP server over a real legal-contracts dataset (CUAD, CC BY 4.0).

Quickstart

uv sync
make demo

make demo runs two pinned fixtures through audit_clause and prints the raw tool JSON — no MCP wiring, no full dataset download (the Chroma index auto-builds in seconds from a committed slice of the data). Without ANTHROPIC_API_KEY set, it runs the abstain case only (gate 1 needs no LLM) and tells you so:

{
  "verdict": "insufficient-grounding",
  "reason": "retrieval evidence too weak to ground a verdict (top similarity 0.423 < threshold 0.64) — escalate to a human reviewer",
  "citation": null,
  "gate1_score": 0.4228,
  "gate1_threshold": 0.64,
  "faithfulness": null,
  "failure_cause": "gate1"
}

With a key set, it also runs a cited verdict case (a mutual 3x work-order liability cap) and prints acceptable/risky/off-standard with a chunk_id citation the server resolved to an exact precedent span.

Architecture

            ┌──────────────────┐
   agent ──▶│   MCP server     │  stdio; logging to stderr only
            │   (server.py)    │  (stdout is reserved for the protocol)
            └──┬──────┬─────┬──┘
   search_clauses  get_standard  audit_clause (reuses both)
        │              │              │
        ▼              ▼              ▼
   ┌──────────┐  ┌───────────┐  ┌─────────────────────────────┐
   │BM25 +    │  │standard,   │  │ GATE 1 (pre-LLM): cosine    │
   │cosine    │  │derived from│  │   evidence score < 0.64     │
   │(Chroma)  │  │15 read     │  │   → insufficient-grounding  │
   │          │  │clauses     │  │ verdict LLM (Sonnet, t=0)   │
   └────┬─────┘  └───────────┘  │ GATE 2 (post-LLM): Haiku     │
        │                       │   faithfulness judge         │
        ▼                       └─────────────────────────────┘
   CUAD liability spans (Cap On Liability / Uncapped Liability)
   → chunked with stable chunk_ids

The three tools

  • search_clauses(query, clause_type="liability", k=5) — BM25-ranked precedent chunks (each with a stable chunk_id, source contract, char span, and score), gated by a cosine evidence-confidence score. Below threshold returns status: "below_threshold" with the scores — an abstention, not an error.
  • get_standard(clause_type="liability") — the liability "playbook": six positions (P1–P6, e.g. mutuality, cap basis, carve-outs) derived from 15 hand-read CUAD clauses, with provenance. Explicitly scoped — not legal advice, not corpus-wide extraction.
  • audit_clause(incoming_clause, clause_type="liability") — reuses both tools above, then runs the two-gate grounding contract below.

The grounding contract (two gates, both in code)

  1. Gate 1 — pre-LLM evidence gate. A leave-one-out cosine similarity score is checked before any LLM call. If the best match is below 0.64 (calibrated against a 13-query negative set — gibberish, out-of-domain clauses, cross-referenced caps), the tool returns insufficient-grounding immediately. No API key needed for this path.
  2. Gate 2 — post-LLM faithfulness judge. A single cheap Haiku call decomposes the verdict's reasoning into claims and checks each is supported by the cited chunk and the standard. An unfaithful verdict is downgraded to insufficient-grounding and counted as a hallucination — never silently shipped.

Citations are chunk_id lookups, not string matching: the verdict LLM picks an id from the evidence it was shown, and the server resolves it to the exact span. If the LLM names an invalid verdict or a chunk_id outside the retrieved evidence, that's also caught and downgraded to insufficient-grounding.

escalate-infra (API timeout/rate-limit/malformed output/refusal) is a separate verdict from insufficient-grounding — "the system is honest" and "the API is flaky" are never conflated.

Connect to Claude Code

claude mcp add cuad-audit -- uv run --directory /path/to/luminance python -m cuad_audit.server
claude mcp list   # health check — should show cuad-audit as connected

Then ask Claude Code to call search_clauses, get_standard, or audit_clause. Tool descriptions document the abstain semantics — agents should relay insufficient-grounding / below_threshold / escalate-infra verbatim, not retry until they get a verdict.

Symptom Likely cause Fix
Server doesn't appear in claude mcp list wrong --directory or uv not on PATH run uv run python -m cuad_audit.server directly from the repo root and check stderr
audit_clause always returns escalate-infra ANTHROPIC_API_KEY not set export ANTHROPIC_API_KEY=... (search_clauses/get_standard still work without it)
First call is slow / looks hung first-run embedding model download (~90 MB) progress prints to stderr; subsequent runs are cached

Setup

  • Python 3.11+, dependency management via uv with a committed lockfile (uv sync).
  • Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~90 MB, downloaded once and cached).
  • Tested on macOS and Linux.
  • First make demo / make ingest: ~1–2 minutes (model download + index build from the committed slice of 680 chunks). Subsequent runs: seconds. Measured smoke test (uv sync && make demo && make eval-retrieval && make test, warm model cache): 97s total, 33 passed + 1 skipped (the skipped test needs ANTHROPIC_API_KEY).
  • ANTHROPIC_API_KEY — only required for audit_clause's verdict + judge calls and make eval-verdicts. Not required for search_clauses, get_standard, the abstain half of make demo, make eval-retrieval, or make calibrate.

Reproducing the measurements

make eval-retrieval   # keyless, deterministic — retrieval vs CUAD spans + kill criteria
make calibrate        # keyless — gate-1 threshold calibration distributions
make eval-verdicts    # needs ANTHROPIC_API_KEY — ~70 LLM calls, a few dollars

Retrieval vs CUAD expert spans (held-out split, docs/DAY2_RESULTS.md)

166 held-out queries against 680 library chunks (80% Cap On Liability / 20% Uncapped Liability):

Metric BM25 MiniLM (cosine)
precision@1 (overall) 0.741 0.705
precision@1 (Uncapped Liability, n=34) 0.559 0.500
success@3 (≥1 relevant in top 3) 0.970 0.970

Embeddings did not beat the keyword baseline, so retrieval ranking and citations are BM25-first (pre-registered Day-2 rule); the semantic index stays in the repo, tested, as the measured comparison and feeds gate 1.

Gate-1 threshold (docs/DAY3_CALIBRATION.md)

Calibrated on leave-one-out cosine top-scores: library positives (n=669, p10=0.64) vs a 13-query negative set (gibberish, out-of-domain clauses, cross-referenced caps). Threshold 0.64 catches 8/13 negatives outright; the remaining 5 (cross-referenced caps, near-domain insurance/audit-rights text) are real liability-adjacent text with no auditable content — owned by the verdict path (standard position P5), not by gate 1. See docs/gate1_calibration.png.

Verdicts vs hand-labeled standard (docs/EVAL.md, results: docs/DAY5_RESULTS.md)

35-item eval set (28 hand labels + 3 cross-reference cases + 3 capped/uncapped confusion pairs + 1 prompt-injection probe), run via make eval-verdicts. Reported as counts and a failure taxonomy, never a headline accuracy — n is too small for that, and the report says so. Columns distinguish grounding abstentions (justified vs unjustified) from infra abstentions (API failures).

One live run (2026-06-10; ±1–2 expected on re-run):

n % of 28
Non-abstained verdict 11 39%
Grounding abstain — justified (R9) 1 4%
Grounding abstain — unjustified 16 57%
Infra abstain 0 0%

6/11 non-abstained verdicts matched the hand label exactly. Of the 16 unjustified abstentions, 14 came from the gate-2 faithfulness judge — a hand-verified sample found the judge, not the verdict LLM, was usually the weak link (rejecting reasonable inferential claims as "unsupported"). Citations were valid on 100% of non-abstained verdicts; adversarial defenses (cross-referenced caps + prompt-injection probe) held 4/4. Full taxonomy, root-cause analysis, and "what I'd do next": docs/DAY5_RESULTS.md. One-page project writeup: docs/WRITEUP.md.

Known limitations (named on purpose)

  • Polarity risk: "Cap on Liability" and "Uncapped Liability" are a negation pair that embedding similarity can confuse. Gate 1 measures evidence strength, not correctness — the verdict LLM owns the capped/uncapped call, and confusion pairs are in the adversarial eval set.
  • Tool-level vs agent-level grounding: the server cannot stop a client agent from speculating after an insufficient-grounding result. The demo harness instructs verbatim relay and shows raw tool output.
  • Single lane (liability), single segmenter (CUAD's expert spans, not a production clause segmenter), standard derived from 15 read clauses — all scoped claims, not corpus-wide extraction. See PLAN.md for the full "Not in Scope" list and rationale.
  • The faithfulness judge (gate 2) is itself an ungated LLM call; a hand-verified sample of judge outputs is reported alongside the eval.

Project layout

src/cuad_audit/
  download.py       CUAD v1 download (pinned sha256)
  derive_slice.py   reproduces the committed data slice byte-identically
  ingest.py         chunking, token-length checks, Chroma index build
  retrieval.py      BM25 (KeywordIndex) + cosine (SemanticIndex)
  llm.py            CompleteFn seam — typed failures, no silent fallbacks
  audit.py          the three tools + both gates
  server.py         MCP stdio entrypoint
  demo.py           make demo
  calibrate.py      gate-1 threshold calibration
  eval_retrieval.py make eval-retrieval
  eval_verdicts.py  make eval-verdicts (resumable JSONL)
data/               committed slice (split, standard, labels, chunks)
docs/               split methodology, rubric, eval definitions, results
tests/              34 tests, LLM seam fully mocked — CI is free

Data & attribution

Built on the Contract Understanding Atticus Dataset (CUAD) v1, © The Atticus Project, licensed under CC BY 4.0. This repo commits a small derived slice (liability-clause spans, data/liability_spans_all.json and data/split.json) for reproducibility; make ingest can re-derive the index from a fresh download via src/cuad_audit/download.py.

Code is MIT licensed.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured