MCP Servers

VERDICT

Enables autonomous DFIR investigation by turning the SIFT toolchain into evidence-safe MCP functions, with self-correction, corroboration, and traceable audit trails.

README

VERDICT

Verifiable Evidence Reasoning for DFIR Investigation, Correlation, and Triage

An autonomous incident response analyst for the SANS SIFT Workstation. VERDICT extends Protocol SIFT with a read-only Custom MCP Server that turns the SIFT toolchain into typed, evidence-safe functions, and a Claude Code agent that forms hypotheses, corroborates every claim against the actual tool output, and self-corrects when sources disagree. It is built to match adversary speed without trading away the one thing a responder cannot lose: the integrity of the evidence and the truth of the findings.

Find Evil is about closing the gap between machine-speed attacks and human-speed response. In November 2025 Anthropic documented GTG-1002, a state-sponsored operation that drove Claude Code through reconnaissance, exploitation, and lateral movement at 80 to 90 percent autonomy, at request rates Anthropic called physically impossible for human operators. That was the offensive side. Protocol SIFT is the defensive answer, and Rob Lee's framing is exact: meet AI threat speed with defensive AI orchestration. VERDICT sharpens that orchestration by making the two failure modes of an autonomous responder (modifying evidence, and confidently reporting things that are not true) structurally impossible rather than merely discouraged.

Submission compliance (read this first)

Every required component, with its exact location. This project is complete and each item is linked below.

#	Required component	Where it is
1	Public code repository, open source	This repository. License below.
2	MIT or Apache 2.0 license file	`LICENSE` (MIT), standalone at repo root, detectable in the About section.
3	README with setup instructions	This file. See Setup and Run it.
4	Live deployment or step-by-step run instructions	Run an investigation: one Docker command, or native on SIFT.
5	Text description of features and functionality	What it does below, and the Devpost project page.
6	Demo video (live terminal, audio narration, a self-correction)	Linked at the top of the Devpost page. A genuine asciinema terminal recording (the raw cast is committed at `logs/examples/demo_session.cast`, replayable with `asciinema play`, proving it is a real terminal capture). Script in `docs/demo_script.md`.
7	Architecture diagram	`docs/architecture.png` (and source `docs/architecture.svg`), with the narrative in `docs/architecture.md`. All five element types and both trust boundaries marked.
8	Evidence dataset documentation	`docs/datasets.md`: sources, hashes, and what the agent found.
9	Accuracy report	`docs/accuracy_report.md`, plus the per-run auto-generated `accuracy_report.md` in each run directory.
10	Agent execution logs (traceable to tool executions)	`logs/examples/`: the agent stream (`agent_stream.jsonl`) and the tool provenance (`provenance.jsonl`) for a full run.

Architectural pattern used (per the brief): Pattern 2, Custom MCP Server, with Claude Code as the agentic engine. The three required capabilities (self-correction without human intervention, accuracy validation traceable to specific artifacts, and a structured investigative narrative) are demonstrated and described in How the design maps to the criteria.

A full self-audit against the official Stage One qualification prompt (the exact 12 checks judges run, from the Judge Pack Appendix A) is in SUBMISSION_CHECKLIST.md. The judges' non-negotiable "trace any finding to its tool execution" check is done for you, with three worked examples, in docs/three_claim_trace.md.

What it does

You point VERDICT at a case (a disk image, a memory capture, a packet capture, or several of them from the same host) and it runs a complete triage the way a senior analyst would:

Orients on the evidence, recording a read-only chain-of-custody hash of every object before touching it.
Forms falsifiable hypotheses about what happened, and names the artifact that would confirm or kill each one.
Sequences tools adaptively. It runs the cheapest tool that can decide a hypothesis, reads the result, and lets that result choose the next tool. There is no fixed pipeline.
Corroborates every finding. Before any claim is allowed into the report, a deterministic engine re-reads the archived tool output and checks the asserted value is really there, in two independent sources, before it will call anything confirmed.
Self-corrects. When the cited output does not support a claim, the claim is caught as a likely hallucination and retracted. When two sources disagree, the agent runs a third to break the tie. Every change of mind is logged with the execution that triggered it.
Cross-checks sources. Given a disk and a memory image from one host, it compares them and flags discrepancies, which is where real intrusions hide.
Proves integrity and reports. It re-hashes the originals to prove they were never modified, then renders a structured investigative narrative and an honest accuracy report, both generated from the run ledger.

The result is an investigation a colleague could defend under cross-examination: every sentence traces to a specific tool execution, confirmed facts are separated from inferences, and the mistakes the agent caught itself making are on the record.

Why it is different from the baseline

The baseline Protocol SIFT is a Claude Code configuration: skill files plus behavioral rules that hand the model a shell (Bash(*)) and ask it to be careful. Its guardrails are prompt-based, and it feeds raw tool output into the model context, which is the documented source of its hallucinations.

VERDICT changes the architecture, not just the prompt:

Concern	Baseline Protocol SIFT	VERDICT
Evidence safety	Prompt says "never modify"; shell can still do anything	No shell, no write tool exists; files opened `O_RDONLY`; writes forced outside the evidence root
Hallucination control	"No hallucinations" instruction	Deterministic corroboration engine re-checks every claim against archived output
Self-correction	"On failure, retry" instruction	Structural: an UNSUPPORTED or CONTRADICTED verdict forces a retraction or tie-break, logged with its trigger
Context overload	Raw tool dumps into the model	Output parsed to compact summaries; raw archived to disk and referenced by id
Audit trail	A summary line appended on stop	One structured provenance record per execution; every finding prints its exec ids

Setup (two paths)

Path A: Docker (recommended for judges, one command)

Requirements: Docker, and Claude Code credentials (an ANTHROPIC_API_KEY, or a mounted ~/.claude). Nothing else.

git clone https://github.com/tejcodes-rex/verdict.git
cd verdict
docker build -t verdict:latest .

The image carries a pinned subset of the SIFT toolchain (The Sleuth Kit, Volatility 3, Volatility 2.6 for older or 32-bit memory images, tshark, YARA, RegRipper, ExifTool) plus the agent runtime, so a run is reproducible on any machine without standing up a SIFT VM.

Path B: Native on the SANS SIFT Workstation

On a SIFT Workstation the tools are already installed. Install the server and the agent config:

git clone https://github.com/tejcodes-rex/verdict.git
cd verdict
pip3 install -e .
# Point Claude Code at the VERDICT MCP server and doctrine:
cp agent/CLAUDE.md   ~/.claude/CLAUDE.md
cp agent/settings.json ~/.claude/settings.json

The MCP server resolves each tool at runtime (Volatility 3 at SIFT's /opt/volatility3*/vol.py, Volatility 2 at SIFT's /usr/local/bin/vol.py, the rest on PATH), with VERDICT_<TOOL> env overrides. Confirm your environment in one command:

python3 -m verdict.doctor      # prints exactly which tool resolved, and where

This was verified against a real SIFT tool layout: with vol deliberately not on PATH (as on the OVA), VERDICT resolved Volatility 3 to python3 /opt/volatility3/vol.py, executed it from that path, and ran a live investigation on the SIFT toolchain. See docs/architecture.md and the accuracy report.

Run an investigation (for judges)

Sample evidence with published ground truth is documented in docs/datasets.md. The Nitroba network case is small (53 MB) and self-validating. To reproduce the headline run:

# 1. Fetch the sample evidence (script downloads from the official source and verifies hashes)
bash scripts/fetch_sample_evidence.sh

# 2. Run the full autonomous investigation in Docker.
#    Evidence is mounted read-only; all output lands in ./work.
ANTHROPIC_API_KEY=sk-... \
EVIDENCE=$(pwd)/evidence/cases/nitroba \
WORK=$(pwd)/work \
docker compose run --rm --entrypoint bash verdict \
  scripts/run_investigation.sh nitroba /evidence /work

(The image entrypoint is the MCP server itself; --entrypoint bash runs the investigation driver instead. If you prefer not to use compose, the equivalent docker run is in docs/submission_guide.md.)

Agent authentication: the agent is Claude Code, so it needs your Claude credentials. Either export ANTHROPIC_API_KEY (shown above), or, if you use a Claude subscription, mount your existing login by adding -v $HOME/.claude:/root/.claude to the run. The image never contains credentials; you supply them at run time. This is the only external dependency, and it is one a judge of this event already has.

When it finishes, look in the newest ./work/run-*/ directory:

report.md is the investigative narrative.
accuracy_report.md is the self-assessment, scored against ground truth.
provenance.jsonl is the tool-execution audit trail.
agent_stream.jsonl is the full agent execution log.

To run the agent against your own evidence, drop it in a directory and pass that as EVIDENCE. The agent adapts to whatever data types it finds.

How the design maps to the judging criteria

Autonomous Execution Quality. The agent reasons over an explicit hypothesis ledger and re-sequences based on what it learns. Self-correction is structural: a corroboration verdict, computed from real tool output, forces it. The triggers are genuine (a value missing from output, two sources disagreeing), so they cannot be staged. See docs/architecture.md.
IR Accuracy. No claim reaches the confirmed list without its asserted value being present in two independent sources. Confirmed facts and inferences are labeled distinctly. The accuracy report is generated from the ledger, lists the hallucinations the agent caught, and is scored against published ground truth.
Breadth and Depth. Disk (The Sleuth Kit), memory (Volatility 3, Volatility 2, and symbol-free string analysis), network (tshark), Windows registry and program-execution evidence (RegRipper, Shimcache, Amcache), and IOC hunting (YARA) are handled deeply, with a grounded MITRE ATT&CK mapping. The Ali Hadi case, run on the genuine SIFT Workstation (logs/examples/alihadi_sift/), correlates a 3 GB disk image and a 1 GB memory capture from one host: it reconstructs the full kill chain (DVWA command injection, an sqlmap SQL-injection campaign, three webshells including a Meterpreter payload, two attacker-created accounts in the on-disk SAM, and RDP persistence), and promotes the RDP-persistence finding to CONFIRMED only when the memory capture independently corroborates the same injected command. Cross-source corroboration of persistence is treated as a first-class, high-value finding.
Constraint Implementation. Guardrails are architectural: the server exposes no destructive primitive, evidence is opened read-only (O_RDONLY | O_NOFOLLOW), the memory tools refuse any Volatility dump plugin or output flag so a tool argument cannot become a write primitive, and a configuration layer denies the generic shell and write tools as defense in depth. The bypass test is documented in the accuracy report.
Audit Trail Quality. One provenance record per execution; every finding prints its exec ids; the three-claim trace is mechanical. The provenance log is a tamper-evident hash chain, and each run is sealed with a verified chain head in manifest.json.
Usability and Documentation. One Docker command to run, and verified on the genuine SANS SIFT Workstation: the official OVA was booted and the full autonomous agent ran on it end to end (logs/sift_verification/, and the SIFT run under logs/examples/nitroba_sift/); python3 -m verdict.doctor reports tool resolution on any host. Ships ground-truth samples, committed runs, a live investigation view (scripts/agent_live.py), and an honest head-to-head with the example submission (benchmark/VALHUNTIR_COMPARISON.md).

Every corroborated finding is also mapped to MITRE ATT&CK techniques (deterministically, grounded only in confirmed findings so it cannot hallucinate context), giving the analyst the kill-chain picture without the false-context risk of a knowledge-base lookup.

Repository layout

verdict/
  verdict/            the Python package (MCP server, engine, tools, reports)
    server.py         the MCP server: typed read-only tools + reasoning tools
    evidence.py       read-only evidence vault and integrity guarantees
    provenance.py     per-execution audit records (JSONL)
    corroborate.py    deterministic claim verifier
    ledger.py         hypotheses, findings, self-correction events
    tools/            sleuthkit, volatility3, tshark, plaso wrappers
    report/           narrative and accuracy-report generators
  agent/              the agentic engine config: doctrine + read-only permissions
  scripts/            run an investigation, fetch sample evidence, score a run
  groundtruth/        answer keys for scored cases
  docs/               architecture, datasets, accuracy report, demo script
  logs/examples/      a committed full run (agent stream + tool provenance)
  tests/              smoke tests that exercise the stack on real evidence

License

MIT. See LICENSE. This project builds on the open-source Protocol SIFT and SANS SIFT Workstation; the novel contribution (the read-only MCP server, the corroboration engine, the hypothesis ledger, and the provenance and reporting layers) is original work created for this event and is documented as such throughout.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured