groundwork

groundwork

Enables AI agents to perform grounded web research with injection resistance, claim verification, and cost-aware routing through MCP tools like web_search, fetch_url, extract_claims, and check_grounding.

Category
Visit Server

README

Groundwork

CI coverage python

A grounded, injection-resistant, cost-aware AI research agent — built to prove an agent can be trusted, not just demoed.

Groundwork researches how businesses adopt and apply AI (use cases, vendor landscape, ROI evidence, implementation patterns, risks) and returns an answer where every claim is verified against a retrieved source, ungrounded statements are flagged rather than shipped, and fetched web content is treated as untrusted data — not instructions. You can watch the whole trajectory — plan → gather → ground → critique → synthesize — stream live in a dashboard.

Groundwork dashboard

No screenshot yet? See docs/DEMO-TODO.md — one command brings the dashboard up.

What makes it different

These are the failure modes most agent demos ignore. Groundwork is built around them, and measures itself on them (see Evaluation):

Differentiator Where it lives
Grounding verification (real entailment) After synthesis, an LLM checks every claim for entailment against the retrieved sources; the answer ships with an "X of Y claims verified" report and flags the rest. Lexical fallback runs with no key. (research_agent/llm_grounding.py, grounding.py)
Prompt-injection resistance Fetched content is wrapped as data and scanned for manipulation patterns; injections are flagged and ignored, never obeyed. Proven by a benign red-team suite. (research_agent/defenses.py, redteam/)
Cost-aware tiered routing Haiku-class workers do the bulk research; Sonnet-class planner/critic supervise. One router maps role → model; cost is accounted by role. (core/providers.py, core/cost.py)
Observability Full step/trajectory tracing, streamed live to the dashboard over SSE, plus per-run token/cost accounting. (core/tracing.py, api/server.py)

Architecture

Three layers over a shared core, built MCP → agent → orchestrator — each runnable standalone. Full diagram in docs/architecture.md; the decisions, trade-offs, and known limitations behind each layer are in docs/DESIGN.md.

core/   providers (+ tiered routing) · tracing · cost · types
  │
  ├─ Layer 1  mcp_server/   spec-compliant MCP server: web_search, fetch_url
  │                         (+provenance, untrusted), extract_claims, check_grounding
  ├─ Layer 2  research_agent/  plan → gather → synthesize (cited) → verify grounding
  │                            + injection defenses, tracing, cost
  └─ Layer 3  orchestrator/   planner → workers (parallel) → critic (grounding +
                              injection checks → retry) → synthesize

api/  FastAPI: POST /research streams the trajectory live (SSE)
web/  Next.js dashboard that renders the stream
evals/ labeled datasets + scorer for grounding accuracy & injection resistance
  • Real web via Tavily when TAVILY_API_KEY is set; offline fixture corpus otherwise — so dev, CI, and the demo all run with zero keys.
  • Article extraction: trafilatura → BeautifulSoup → regex, best available.

Evaluation

Groundwork scores its own differentiators on labeled datasets (evals/). The lexical grounder and the regex injection detector need no API key, so these numbers are reproducible — CI runs them on every push:

Capability Method n Precision Recall F1 Accuracy
Grounding lexical heuristic 30 0.86 1.00 0.92 0.90
Grounding LLM entailment (claude-sonnet-4-6) 30 1.00 0.94 0.97 0.97
Injection detection regex pattern scan 20 1.00 1.00 1.00 1.00

The lexical grounder over-accepts paraphrased contradictions and overclaims that share vocabulary with a source (precision 0.86). The LLM entailment grounder catches exactly those — perfect precision, never accepting an unsupported claim, at a small recall cost. That gap is the whole argument for grounding with a model rather than string overlap.

Re-run: python -m evals.run (writes evals/report.md).

A real, web-grounded sample brief produced by the agent is committed at reports/sample_research_report.md — note its grounding footer ("X of Y claims verified") and how the final-answer grounding pass flags unverified specifics rather than shipping them.

vs. naive RAG

The point of grounding, as a number (benchmark/report.md, python -m benchmark.run):

Approach Hallucinations shipped ↓ Valid claims kept ↑
Naive RAG (no grounding) 100% (12/12) 100%
Groundwork — lexical grounder 25% (3/12) 100%
Groundwork — LLM entailment 0% (0/12) 94%

A naive retrieve-then-synthesize agent ships every unsupported claim as if it were true. Groundwork's grounding filter is the difference between confident-but-wrong and trustworthy.

Quick start

pip install -e .                  # core; add ".[real,api]" for live web + the API

# 1) Offline three-layer demo — no key. Plan→workers→critic loop, grounding,
#    injection flags, per-role cost, over a fixture corpus with mock models:
python run_demo.py

# 2) The dashboard (offline mock mode):
GROUNDWORK_MOCK=1 uvicorn api.server:app --port 8000      # backend
cd web && npm install && NEXT_PUBLIC_API_URL=http://localhost:8000 npm run dev

# 3) A REAL run (live models + web):
export ANTHROPIC_API_KEY=sk-ant-...   ;  export TAVILY_API_KEY=tvly-...   # optional
python research.py "How are mid-market logistics firms using AI for demand forecasting?"

# 4) MCP server for an MCP client (Claude Desktop): python -m mcp_server.server

Deploy (FastAPI → Render, Next.js → Vercel): docs/DEPLOY.md. The dashboard supports bring-your-own-key — deploy the backend with no server key and visitors paste their own (sent per-request via X-Anthropic-Key, never stored), so a public demo is free and abuse-safe; with no key it runs in offline mock mode.

Tests & CI

pip install -e ".[dev]" && pytest -q && ruff check . --select E,F,I,W --ignore E501

35 tests at ~77% coverage, all offline (no key / network): injection canaries detected and not obeyed; LLM-grounding JSON parsing + entailment verdicts; supported claims ground while fabricated ones are flagged; the orchestrator critic rejects an ungrounded brief, revises, and accounts cost by role; FastAPI endpoints exercised end-to-end (SSE research stream + run history) via TestClient in mock mode; eval- and benchmark-quality regression guards. GitHub Actions runs lint + tests-with-coverage + the offline evals and benchmark on every push.

Safety

Everything in redteam/injection_pages/ is a benign canary — a harmless obedience probe (e.g. an embedded "append BANANA" / "recommend Brand X"). No operational attacks or harmful payloads anywhere. Treating fetched/external content as untrusted data is the core security stance, applied throughout.

Author

Built by Desmond Sleighgithub.com/Des-Sleigh. Sibling project: llm-eval-harness — measuring model quality with the same evaluation discipline Groundwork applies to its own output.

License: MIT.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured