VERDICT
Enables autonomous DFIR investigation by turning the SIFT toolchain into evidence-safe MCP functions, with self-correction, corroboration, and traceable audit trails.
README
VERDICT
Verifiable Evidence Reasoning for DFIR Investigation, Correlation, and Triage
An autonomous incident response analyst for the SANS SIFT Workstation. VERDICT extends Protocol SIFT with a read-only Custom MCP Server that turns the SIFT toolchain into typed, evidence-safe functions, and a Claude Code agent that forms hypotheses, corroborates every claim against the actual tool output, and self-corrects when sources disagree. It is built to match adversary speed without trading away the one thing a responder cannot lose: the integrity of the evidence and the truth of the findings.
Find Evil is about closing the gap between machine-speed attacks and human-speed response. In November 2025 Anthropic documented GTG-1002, a state-sponsored operation that drove Claude Code through reconnaissance, exploitation, and lateral movement at 80 to 90 percent autonomy, at request rates Anthropic called physically impossible for human operators. That was the offensive side. Protocol SIFT is the defensive answer, and Rob Lee's framing is exact: meet AI threat speed with defensive AI orchestration. VERDICT sharpens that orchestration by making the two failure modes of an autonomous responder (modifying evidence, and confidently reporting things that are not true) structurally impossible rather than merely discouraged.
Submission compliance (read this first)
Every required component, with its exact location. This project is complete and each item is linked below.
| # | Required component | Where it is |
|---|---|---|
| 1 | Public code repository, open source | This repository. License below. |
| 2 | MIT or Apache 2.0 license file | LICENSE (MIT), standalone at repo root, detectable in the About section. |
| 3 | README with setup instructions | This file. See Setup and Run it. |
| 4 | Live deployment or step-by-step run instructions | Run an investigation: one Docker command, or native on SIFT. |
| 5 | Text description of features and functionality | What it does below, and the Devpost project page. |
| 6 | Demo video (live terminal, audio narration, a self-correction) | Linked at the top of the Devpost page. A genuine asciinema terminal recording (the raw cast is committed at logs/examples/demo_session.cast, replayable with asciinema play, proving it is a real terminal capture). Script in docs/demo_script.md. |
| 7 | Architecture diagram | docs/architecture.png (and source docs/architecture.svg), with the narrative in docs/architecture.md. All five element types and both trust boundaries marked. |
| 8 | Evidence dataset documentation | docs/datasets.md: sources, hashes, and what the agent found. |
| 9 | Accuracy report | docs/accuracy_report.md, plus the per-run auto-generated accuracy_report.md in each run directory. |
| 10 | Agent execution logs (traceable to tool executions) | logs/examples/: the agent stream (agent_stream.jsonl) and the tool provenance (provenance.jsonl) for a full run. |
Architectural pattern used (per the brief): Pattern 2, Custom MCP Server, with Claude Code as the agentic engine. The three required capabilities (self-correction without human intervention, accuracy validation traceable to specific artifacts, and a structured investigative narrative) are demonstrated and described in How the design maps to the criteria.
A full self-audit against the official Stage One qualification prompt (the exact 12 checks
judges run, from the Judge Pack Appendix A) is in
SUBMISSION_CHECKLIST.md. The judges' non-negotiable
"trace any finding to its tool execution" check is done for you, with three worked
examples, in docs/three_claim_trace.md.
What it does
You point VERDICT at a case (a disk image, a memory capture, a packet capture, or several of them from the same host) and it runs a complete triage the way a senior analyst would:
- Orients on the evidence, recording a read-only chain-of-custody hash of every object before touching it.
- Forms falsifiable hypotheses about what happened, and names the artifact that would confirm or kill each one.
- Sequences tools adaptively. It runs the cheapest tool that can decide a hypothesis, reads the result, and lets that result choose the next tool. There is no fixed pipeline.
- Corroborates every finding. Before any claim is allowed into the report, a deterministic engine re-reads the archived tool output and checks the asserted value is really there, in two independent sources, before it will call anything confirmed.
- Self-corrects. When the cited output does not support a claim, the claim is caught as a likely hallucination and retracted. When two sources disagree, the agent runs a third to break the tie. Every change of mind is logged with the execution that triggered it.
- Cross-checks sources. Given a disk and a memory image from one host, it compares them and flags discrepancies, which is where real intrusions hide.
- Proves integrity and reports. It re-hashes the originals to prove they were never modified, then renders a structured investigative narrative and an honest accuracy report, both generated from the run ledger.
The result is an investigation a colleague could defend under cross-examination: every sentence traces to a specific tool execution, confirmed facts are separated from inferences, and the mistakes the agent caught itself making are on the record.
Why it is different from the baseline
The baseline Protocol SIFT is a Claude Code configuration: skill files plus behavioral
rules that hand the model a shell (Bash(*)) and ask it to be careful. Its guardrails
are prompt-based, and it feeds raw tool output into the model context, which is the
documented source of its hallucinations.
VERDICT changes the architecture, not just the prompt:
| Concern | Baseline Protocol SIFT | VERDICT |
|---|---|---|
| Evidence safety | Prompt says "never modify"; shell can still do anything | No shell, no write tool exists; files opened O_RDONLY; writes forced outside the evidence root |
| Hallucination control | "No hallucinations" instruction | Deterministic corroboration engine re-checks every claim against archived output |
| Self-correction | "On failure, retry" instruction | Structural: an UNSUPPORTED or CONTRADICTED verdict forces a retraction or tie-break, logged with its trigger |
| Context overload | Raw tool dumps into the model | Output parsed to compact summaries; raw archived to disk and referenced by id |
| Audit trail | A summary line appended on stop | One structured provenance record per execution; every finding prints its exec ids |
Setup (two paths)
Path A: Docker (recommended for judges, one command)
Requirements: Docker, and Claude Code credentials (an ANTHROPIC_API_KEY, or a mounted
~/.claude). Nothing else.
git clone https://github.com/tejcodes-rex/verdict.git
cd verdict
docker build -t verdict:latest .
The image carries a pinned subset of the SIFT toolchain (The Sleuth Kit, Volatility 3, Volatility 2.6 for older or 32-bit memory images, tshark, YARA, RegRipper, ExifTool) plus the agent runtime, so a run is reproducible on any machine without standing up a SIFT VM.
Path B: Native on the SANS SIFT Workstation
On a SIFT Workstation the tools are already installed. Install the server and the agent config:
git clone https://github.com/tejcodes-rex/verdict.git
cd verdict
pip3 install -e .
# Point Claude Code at the VERDICT MCP server and doctrine:
cp agent/CLAUDE.md ~/.claude/CLAUDE.md
cp agent/settings.json ~/.claude/settings.json
The MCP server resolves each tool at runtime (Volatility 3 at SIFT's /opt/volatility3*/vol.py,
Volatility 2 at SIFT's /usr/local/bin/vol.py, the rest on PATH), with VERDICT_<TOOL>
env overrides. Confirm your environment in one command:
python3 -m verdict.doctor # prints exactly which tool resolved, and where
This was verified against a real SIFT tool layout: with vol deliberately not on PATH (as
on the OVA), VERDICT resolved Volatility 3 to python3 /opt/volatility3/vol.py, executed it
from that path, and ran a live investigation on the SIFT toolchain. See
docs/architecture.md and the accuracy report.
Run an investigation (for judges)
Sample evidence with published ground truth is documented in
docs/datasets.md. The Nitroba network case is small (53 MB) and
self-validating. To reproduce the headline run:
# 1. Fetch the sample evidence (script downloads from the official source and verifies hashes)
bash scripts/fetch_sample_evidence.sh
# 2. Run the full autonomous investigation in Docker.
# Evidence is mounted read-only; all output lands in ./work.
ANTHROPIC_API_KEY=sk-... \
EVIDENCE=$(pwd)/evidence/cases/nitroba \
WORK=$(pwd)/work \
docker compose run --rm --entrypoint bash verdict \
scripts/run_investigation.sh nitroba /evidence /work
(The image entrypoint is the MCP server itself; --entrypoint bash runs the
investigation driver instead. If you prefer not to use compose, the equivalent
docker run is in docs/submission_guide.md.)
Agent authentication: the agent is Claude Code, so it needs your Claude credentials.
Either export ANTHROPIC_API_KEY (shown above), or, if you use a Claude subscription,
mount your existing login by adding -v $HOME/.claude:/root/.claude to the run. The
image never contains credentials; you supply them at run time. This is the only external
dependency, and it is one a judge of this event already has.
When it finishes, look in the newest ./work/run-*/ directory:
report.mdis the investigative narrative.accuracy_report.mdis the self-assessment, scored against ground truth.provenance.jsonlis the tool-execution audit trail.agent_stream.jsonlis the full agent execution log.
To run the agent against your own evidence, drop it in a directory and pass that as
EVIDENCE. The agent adapts to whatever data types it finds.
How the design maps to the judging criteria
- Autonomous Execution Quality. The agent reasons over an explicit hypothesis ledger
and re-sequences based on what it learns. Self-correction is structural: a
corroboration verdict, computed from real tool output, forces it. The triggers are
genuine (a value missing from output, two sources disagreeing), so they cannot be
staged. See
docs/architecture.md. - IR Accuracy. No claim reaches the confirmed list without its asserted value being present in two independent sources. Confirmed facts and inferences are labeled distinctly. The accuracy report is generated from the ledger, lists the hallucinations the agent caught, and is scored against published ground truth.
- Breadth and Depth. Disk (The Sleuth Kit), memory (Volatility 3, Volatility 2, and
symbol-free string analysis), network (tshark), Windows registry and program-execution
evidence (RegRipper, Shimcache, Amcache), and IOC hunting (YARA) are handled deeply, with
a grounded MITRE ATT&CK mapping. The Ali Hadi case, run on the genuine SIFT Workstation
(
logs/examples/alihadi_sift/), correlates a 3 GB disk image and a 1 GB memory capture from one host: it reconstructs the full kill chain (DVWA command injection, an sqlmap SQL-injection campaign, three webshells including a Meterpreter payload, two attacker-created accounts in the on-disk SAM, and RDP persistence), and promotes the RDP-persistence finding to CONFIRMED only when the memory capture independently corroborates the same injected command. Cross-source corroboration of persistence is treated as a first-class, high-value finding. - Constraint Implementation. Guardrails are architectural: the server exposes no
destructive primitive, evidence is opened read-only (
O_RDONLY | O_NOFOLLOW), the memory tools refuse any Volatility dump plugin or output flag so a tool argument cannot become a write primitive, and a configuration layer denies the generic shell and write tools as defense in depth. The bypass test is documented in the accuracy report. - Audit Trail Quality. One provenance record per execution; every finding prints its
exec ids; the three-claim trace is mechanical. The provenance log is a tamper-evident
hash chain, and each run is sealed with a verified chain head in
manifest.json. - Usability and Documentation. One Docker command to run, and verified on the genuine
SANS SIFT Workstation: the official OVA was booted and the full autonomous agent ran on
it end to end (
logs/sift_verification/, and the SIFT run underlogs/examples/nitroba_sift/);python3 -m verdict.doctorreports tool resolution on any host. Ships ground-truth samples, committed runs, a live investigation view (scripts/agent_live.py), and an honest head-to-head with the example submission (benchmark/VALHUNTIR_COMPARISON.md).
Every corroborated finding is also mapped to MITRE ATT&CK techniques (deterministically, grounded only in confirmed findings so it cannot hallucinate context), giving the analyst the kill-chain picture without the false-context risk of a knowledge-base lookup.
Repository layout
verdict/
verdict/ the Python package (MCP server, engine, tools, reports)
server.py the MCP server: typed read-only tools + reasoning tools
evidence.py read-only evidence vault and integrity guarantees
provenance.py per-execution audit records (JSONL)
corroborate.py deterministic claim verifier
ledger.py hypotheses, findings, self-correction events
tools/ sleuthkit, volatility3, tshark, plaso wrappers
report/ narrative and accuracy-report generators
agent/ the agentic engine config: doctrine + read-only permissions
scripts/ run an investigation, fetch sample evidence, score a run
groundtruth/ answer keys for scored cases
docs/ architecture, datasets, accuracy report, demo script
logs/examples/ a committed full run (agent stream + tool provenance)
tests/ smoke tests that exercise the stack on real evidence
License
MIT. See LICENSE. This project builds on the open-source Protocol SIFT and
SANS SIFT Workstation; the novel contribution (the read-only MCP server, the
corroboration engine, the hypothesis ledger, and the provenance and reporting layers) is
original work created for this event and is documented as such throughout.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.