aegis-dq

aegis-dq

Agentic data quality MCP server — runs structured validation rules against warehouses (DuckDB, BigQuery, Athena, Databricks, Postgres), diagnoses failures with LLM root cause analysis, and proposes SQL remediations. Full audit trail of every AI decision.

Category
Visit Server

README

Aegis DQ

CI PyPI Downloads Python License GitHub Marketplace Open in Colab

The open-source agentic data quality framework. Validate data contracts, diagnose failures with LLM root-cause analysis, and auto-generate SQL remediation — all in a single CI step or Python call.

  • 31 rule types — completeness, uniqueness, validity, referential integrity, statistical, ML anomaly detection
  • 6 warehouse adapters — DuckDB, Postgres/Redshift, BigQuery, Databricks, AWS Athena, Snowflake
  • Pluggable LLMs — Anthropic Claude, OpenAI, Ollama (local), AWS Bedrock
  • Agentic pipeline — plan → parallel validation → LLM diagnose → RCA → SQL remediate → report

GitHub Actions — Quick Start

Add a data quality gate to any workflow in under 2 minutes:

# .github/workflows/data-quality.yml
name: Data Quality

on: [push, pull_request]

jobs:
  data-quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate data quality
        uses: aegis-dq/aegis-dq@v0.7.0
        with:
          rules-file: rules.yaml
          db: data/warehouse.duckdb
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}

The step fails the job automatically when any rules fail, blocking broken data from reaching production. Set fail-on-failure: 'false' to report without blocking.

Offline mode (no API key required):

      - name: Validate data quality (offline)
        uses: aegis-dq/aegis-dq@v0.7.0
        with:
          rules-file: rules.yaml
          db: data/warehouse.duckdb
          no-llm: 'true'

Action inputs

Input Default Description
rules-file rules.yaml Path to rules YAML
db :memory: DuckDB file path
warehouse duckdb duckdb · postgres · redshift
pg-dsn PostgreSQL / Redshift connection DSN
no-llm false Skip LLM — free offline validation
llm anthropic anthropic · openai · ollama
llm-model (provider default) Override the default model
fail-on-failure true Fail the step when rules fail
version (latest) Pin a specific aegis-dq version
anthropic-api-key Required when llm: anthropic
openai-api-key Required when llm: openai

Action outputs

Output Description
rules-checked Total rules evaluated
passed Rules that passed
failed Rules that failed
pass-rate Pass rate as a decimal (e.g. "91.67")
report-json Absolute path to the full JSON report

Using outputs in downstream steps:

      - name: Validate data quality
        id: dq
        uses: aegis-dq/aegis-dq@v0.7.0
        with:
          rules-file: rules.yaml

      - name: Post summary
        run: echo "Pass rate: ${{ steps.dq.outputs.pass-rate }}%"

Demo

Aegis DQ Demo

╭──────────────────────────────────────────────────────╮
│ Aegis DQ  —  RetailCo E-commerce Demo                │
│ LLM: amazon.nova-pro-v1:0 via AWS Bedrock            │
╰──────────────────────────────────────────────────────╯

✓ Pipeline complete in 7.1s · 12 rules · $0.0056 LLM cost

╭──────────────── Validation Summary ─────────────────╮
│  Rules checked  │  12                               │
│  Passed         │  1   │  Failed  │  11             │
│  Pass rate      │  8%  │  Cost    │  $0.005576      │
╰─────────────────────────────────────────────────────╯

LLM Diagnoses
  orders_customer_fk  →  Order placed with customer_id=99 that does not exist.
                         Likely cause: customer deleted or test record not cleaned up.

  products_sku_unique →  Duplicate SKU-001 — two products share the same identifier.
                         Likely cause: duplicate import from supplier feed.

Remediation SQL (LLM-generated)
  orders_status_valid          UPDATE orders SET status = 'SHIPPED' WHERE status = 'DISPATCHED';
  products_price_positive      UPDATE products SET price = ABS(price) WHERE price < 0;
  products_stock_non_negative  UPDATE products SET stock_quantity = 0 WHERE stock_quantity < 0;

Why Aegis?

Aegis DQ Great Expectations / Soda Monte Carlo / Anomalo
Open source ✅ Apache 2.0 ❌ Commercial
Agentic LLM diagnosis + RCA ✅ Proprietary
SQL auto-fix proposals
Audit trail (per-decision log) Partial ✅ Proprietary
Pluggable LLM (Anthropic, OpenAI, Bedrock, Ollama)
dbt integration Partial
Portable open rule standard Partial
ML anomaly detection ✅ built-in ✅ Proprietary

Install

pip install aegis-dq
Extra What it adds
aegis-dq[bigquery] BigQuery adapter
aegis-dq[databricks] Databricks adapter
aegis-dq[athena] AWS Athena adapter
aegis-dq[postgres] PostgreSQL / Redshift adapter
aegis-dq[snowflake] Snowflake adapter
aegis-dq[rest] REST API server (FastAPI + uvicorn)
aegis-dq[openai] OpenAI LLM provider
aegis-dq[airflow] Airflow AegisOperator
aegis-dq[mcp] MCP server for Claude Desktop
aegis-dq[ml] scikit-learn anomaly detection

5-minute quickstart

Seed a demo DuckDB database:

import duckdb

con = duckdb.connect("demo.db")
con.execute("""
    CREATE TABLE orders AS
    SELECT i AS order_id, 'placed' AS status, i * 9.99 AS revenue
    FROM range(1, 10001) t(i)
""")
# introduce some bad data
con.execute("UPDATE orders SET order_id = NULL WHERE order_id % 200 = 0")
con.execute("UPDATE orders SET revenue = -5.00 WHERE order_id % 500 = 0")
con.close()

Generate a starter rules file and run:

aegis init

export ANTHROPIC_API_KEY=sk-ant-...
aegis run rules.yaml --db demo.db

Run without an API key (validation only, no LLM diagnosis):

aegis run rules.yaml --db demo.db --no-llm

Pipeline

Every aegis run passes your data through a LangGraph pipeline:

rules (Python / YAML)
    │
    ▼
  plan ──► parallel_table ──► reconcile ──► remediate ──► report
                 │
         ┌──────────────────┐
         │  per table:      │
         │  execute         │
         │  classify        │
         │  diagnose        │  ← concurrent across all tables
         │  rca             │
         └──────────────────┘
  • plan — parse and validate rules, build an execution graph
  • parallel_table — concurrently fans out per table: execute all rules, classify failures by severity, diagnose with LLM, and trace root causes
  • reconcile — compare results against expected thresholds
  • remediate — LLM proposes a targeted SQL fix for each diagnosed failure
  • report — structured JSON + optional Slack notification

Rule types (31 total)

Category Types
Completeness not_null not_empty_string null_percentage_below
Uniqueness unique composite_unique duplicate_percentage_below
Validity sql_expression between min_value_check max_value_check regex_match accepted_values not_accepted_values no_future_dates column_exists
Referential foreign_key conditional_not_null
Statistical mean_between stddev_below column_sum_between
Timeliness freshness date_order
Volume row_count row_count_between custom_sql
Cross-table reconcile_row_count reconcile_column_sum reconcile_key_match
ML / Anomaly zscore_outlier isolation_forest learned_threshold

Example rule:

rules:
  - apiVersion: aegis.dev/v1
    kind: DataQualityRule
    metadata:
      id: orders_revenue_non_negative
      severity: critical
      owner: revenue-team
      tags: [revenue, validity]
    scope:
      warehouse: duckdb
      table: orders
    logic:
      type: sql_expression
      expression: "revenue >= 0"

Generate rules with the LLM

Instead of writing rules by hand, let Aegis introspect your table schema and generate a draft rules file:

# Schema-aware structural rules (not_null, between, unique, accepted_values...)
aegis generate orders --db warehouse.duckdb --output orders_rules.yaml

Add a --kb document — any plain text or markdown file describing your business logic — and the LLM generates business validation rules alongside structural ones:

aegis generate orders \
  --db warehouse.duckdb \
  --kb docs/orders_policy.md \
  --output orders_rules.yaml

What goes in a KB file? Anything your team knows about the data:

# orders_policy.md
- status must be one of: placed, confirmed, shipped, delivered, cancelled
- amount must be greater than 0; refunds are handled in a separate table
- customer_id must reference a valid customer (no test accounts: id > 1000)
- order_date must not be in the future
- discount_pct must be between 0 and 0.5 (max 50% discount)

The LLM turns these into accepted_values, sql_expression, between, and foreign_key rules automatically. Generated rules are stamped status: draft — review, promote to active, and commit.

All aegis generate options:

Flag Default Description
--db DuckDB file for schema introspection
--kb Business-context file (text/markdown)
--output rules.yaml Output YAML file
--max-rules 20 Cap on number of rules generated
--no-verify false Skip SQL verification of generated rules
--save-versions false Persist rules to version store
--provider anthropic LLM provider
--model (default) Override model

Warehouse adapters

Adapter Install Status
DuckDB built-in ✅ GA
BigQuery aegis-dq[bigquery] ✅ GA
Databricks aegis-dq[databricks] ✅ GA
AWS Athena aegis-dq[athena] ✅ GA
Postgres / Redshift aegis-dq[postgres] ✅ GA
Snowflake aegis-dq[snowflake] ✅ GA

LLM providers

Provider Install Default model
Anthropic (Claude) built-in claude-haiku-4-5
OpenAI aegis-dq[openai] gpt-4o-mini
Ollama (local) aegis-dq[ollama] llama3.2
AWS Bedrock pip install boto3 amazon.nova-pro-v1:0

Switch providers at the CLI:

aegis run rules.yaml --llm openai --llm-model gpt-4o
aegis run rules.yaml --llm ollama --llm-model llama3.2
aegis run rules.yaml --llm bedrock --llm-model amazon.nova-pro-v1:0

Integrations

Integration What it does
GitHub Action CI/CD gate — fails the job when rules fail
aegis-dq[rest] REST API server — aegis serve
aegis-dq[airflow] AegisOperator — drop-in Airflow task
aegis-dq[mcp] MCP server for Claude Desktop / tool use
aegis dbt generate Convert dbt manifest.json to Aegis rules

CLI reference

Command Description
aegis init Generate a starter rules.yaml
aegis validate <config> Check YAML syntax + schema (no warehouse needed)
aegis generate <table> LLM-generate rules from table schema
aegis run <config> Run validation, diagnose failures, produce a report
aegis rules list Browse built-in rule templates
aegis audit trajectory <run-id> Inspect the LLM decision trail for a past run
aegis audit search <query> Full-text search across audit logs
aegis dbt generate <manifest> Convert a dbt manifest to Aegis rules
aegis mcp serve Start the MCP server for Claude Desktop

aegis run flags:

Flag Default Description
--db :memory: DuckDB file path
--llm anthropic LLM provider
--llm-model (provider default) Override model name
--no-llm false Skip LLM diagnosis entirely
--output-json (none) Write full JSON report to file
--notify (none) Slack webhook URL
--notify-on failures When to notify: all · failures · critical

Roadmap

Phase Version Items Status
Foundation v0.1 Core agent, DuckDB, CLI, audit trail ✅ Done
Differentiate v0.5 BigQuery, Databricks, Athena, Airflow, Ollama, RCA, ShareGPT export, FTS5 search, dbt, MCP ✅ Done
Quality v0.7 SQL verification pipeline, rule versioning, aegis generate (LLM + KB), GitHub Action, ML anomaly detection ✅ Done
Mature v1.0 Postgres, REST API, parallel subagents, VS Code extension, eval suite, banking/healthcare packs 🚧 In progress

Full issue tracker: github.com/aegis-dq/aegis-dq/issues


Contributing

Contributions are welcome. See CONTRIBUTING.md to get started.

Good first issues: label:good first issue

License

Apache 2.0

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured