AI-Driven Remediation Testing

AI-Driven Remediation Testing

Orchestrates end-to-end testing of AI-powered incident remediation workflows through declarative YAML scenarios, fault injection, AI response evaluation, and automated command execution with comprehensive reporting.

Category
Visit Server

README

MCP Server - AI-Driven Remediation Testing

A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations.

Overview

MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports.

Features

  • Declarative Scenarios: Define test scenarios in YAML with variable substitution
  • FSM-Based Orchestration: 13-state finite state machine for reliable execution
  • Fault Injection: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.)
  • AI Evaluation: Score AI responses using regex, JSON Schema, and semantic similarity
  • Secure Execution: Sandboxed command execution with deny patterns
  • Remediation API Integration: Full HTTP/WebSocket client for workflow APIs
  • Comprehensive Logging: DEBUG+ file logs, INFO+ console, artifact management
  • Production-Ready: Type-safe Python 3.11+ with pydantic validation

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     MCP Server (gRPC)                        │
├─────────────────────────────────────────────────────────────┤
│  ScenarioService │ FaultService │ ExecutorService │ EvalService│
└─────────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
┌──────────────────┐  ┌──────────────┐  ┌──────────────────┐
│ Orchestration    │  │ Fault        │  │ Command          │
│ Engine (FSM)     │  │ Injection    │  │ Executor         │
└──────────────────┘  └──────────────┘  └──────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│              Remediation Workflow API Client                 │
│         (HTTP + WebSocket, InitiateEnsemble, Resume)        │
└─────────────────────────────────────────────────────────────┘

Installation

# Install dependencies
pip install -r requirements.txt

# Generate gRPC code (optional, using simplified implementation for MVP)
# python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto

Configuration

Configuration can be provided via config.yaml or environment variables:

# config.yaml
log_dir: "./log"
session_timeout_sec: 300

ws:
  ping_interval: 300
  ping_timeout: 300

grpc:
  host: "localhost"
  port: 50051
  timeout: 300

http:
  base_url: "http://localhost:8901"
  ws_url: "ws://localhost:8765/chatsocket"
  token_url: "https://app.lab0.signalfx.com/v2/jwt/token"

Environment variables (override config.yaml):

export CONFIG_PATH=./config.yaml
export MCP_LOG_DIR=./log
export MCP_GRPC__HOST=localhost
export MCP_GRPC__PORT=50051
export MCP_HTTP__BASE_URL=http://localhost:8901

Scenario Definition

Scenarios are defined in YAML with the following structure:

meta:
  id: scenario-001
  title: "Test Scenario"
  owner: "team-name"

defaults:
  model: "gpt-4"
  timeout: 300

bindings:
  namespace: "production"
  service: "api-gateway"

fault:
  type: "pod_kill"
  params:
    namespace: "${namespace}"

stabilize:
  wait_for:
    timeout: 120

assistant_rca:
  system: "You are an SRE expert."
  user: "Analyze the incident."
  expect:
    references: ["pod", "crash"]
    metrics: ["cpu", "memory"]
    guards:
      - type: "regex"
        pattern: "(?i)root cause"

assistant_remedy:
  system: "Provide remediation."
  user: "What commands should we run?"
  expect:
    references: ["kubectl"]

execute_remedy:
  sandbox:
    service_account: "sre-bot"
    namespace: "${namespace}"
    policies:
      deny_patterns:
        - ".*rm -rf.*"
  commands:
    - name: "Restart pods"
      cmd: "kubectl"
      args: ["rollout", "restart", "deployment/${service}"]

verify:
  signalflow:
    - program: "data('cpu.utilization').mean().publish()"
      assert_rules: ["value < 70"]

cleanup:
  always:
    - name: "Reset state"
      cmd: "kubectl"
      args: ["delete", "pod", "-l", "app=${service}"]

report:
  formats: ["json"]

FSM States

The orchestration engine follows this state machine:

  1. INIT: Initialize scenario, resolve bindings
  2. PRECHECK: Run pre-execution checks (SignalFlow)
  3. FAULT_INJECT: Inject fault using FaultService
  4. STABILIZE: Wait for system stabilization
  5. ASSISTANT_RCA: Get RCA from remediation API
  6. EVAL_RCA: Evaluate RCA response
  7. ASSISTANT_REMEDY: Get remediation commands
  8. EVAL_REMEDY: Evaluate remedy response
  9. EXECUTE_REMEDY: Execute commands
  10. VERIFY: Verify system state
  11. PASS: Scenario passed
  12. FAIL: Scenario failed
  13. CLEANUP: Clean up resources

Usage

Start Server

python -m mcp_server.server

Run Scenario (Programmatic)

import asyncio
from mcp_server.server import MCPServer
from mcp_server.config import get_settings

async def main():
    settings = get_settings()
    server = MCPServer(settings)

    # Run scenario
    result = await server.scenario_service.run_scenario(
        scenario_yaml=open("scenarios/example_scenario.yaml").read(),
        bindings={"namespace": "staging"}
    )

    print(f"Run ID: {result['run_id']}")
    print(f"Status: {result['status']}")

asyncio.run(main())

Check Results

Results are stored in log/runs/{run_id}/:

  • scenario.yaml: Original scenario
  • transcript.json: RCA/remedy responses
  • report.json: Final test report
  • cmd_*.txt: Command outputs

Services

FaultService

Injects and cleans up faults. Stub implementation provided; integrate with:

  • Chaos Mesh (Kubernetes)
  • Litmus (Kubernetes)
  • Gremlin (Cloud)

ExecutorService

Executes commands with sandboxing:

  • Local execution via asyncio.subprocess
  • Deny pattern enforcement
  • Output capture and artifact storage

EvalService

Evaluates AI responses:

  • Regex guards: Pattern matching
  • JSON Schema: Structure validation
  • Semantic similarity: Token-based Jaccard

RemediationClient

HTTP client for remediation workflow API:

  • initiate_remediation(): Start new workflow
  • resume_remediation(): Resume with input
  • JSON pointer resolution for graph navigation

API Reference

ScenarioService

service ScenarioService {
  rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse);
  rpc ListScenarios(Empty) returns (ListScenariosResponse);
  rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse);
  rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent);
}

Remediation API

InitiateEnsemble:

{
  "apiMethod": "InitiateEnsemble",
  "apiVersion": "1",
  "ensembleName": "REMEDIATION",
  "payload": {
    "incidentId": "inc-123",
    "rcaAnalysis": {
      "title": "Pod Crash",
      "summary": "API gateway pod crashed",
      "nextSteps": "Awaiting analysis"
    }
  }
}

ResumeEnsemble:

{
  "apiMethod": "ResumeEnsemble",
  "apiVersion": "1",
  "payload": {
    "messageType": "node_input",
    "stateIdentifier": {
      "threadId": "thread-123",
      "interruptId": "int-456"
    },
    "nodeId": "node-789",
    "inputProperties": {
      "input": "User input text"
    }
  }
}

Logging

  • Console: INFO+ (concise)
  • File: DEBUG+ at log/mcp_server.log (rotating, 10MB, 5 backups)
  • Artifacts: Per-run in log/runs/{run_id}/

Development

Project Structure

Remidiation-MCP/
├── config.yaml              # Configuration
├── requirements.txt         # Dependencies
├── proto/                   # gRPC definitions
│   ├── common.proto
│   ├── scenario_service.proto
│   ├── fault_service.proto
│   ├── executor_service.proto
│   └── eval_service.proto
├── mcp_server/
│   ├── __init__.py
│   ├── config.py            # Settings
│   ├── logging_config.py    # Logging
│   ├── server.py            # gRPC server
│   ├── models/              # Pydantic models
│   │   └── scenario.py
│   ├── services/            # Service implementations
│   │   ├── fault_service.py
│   │   ├── executor_service.py
│   │   └── eval_service.py
│   ├── clients/             # API clients
│   │   └── remediation_client.py
│   ├── orchestration/       # Orchestration engine
│   │   ├── fsm.py
│   │   └── engine.py
│   └── utils/               # Utilities
│       ├── variables.py
│       └── artifacts.py
├── scenarios/               # Test scenarios
│   └── example_scenario.yaml
└── log/                     # Logs and artifacts

Testing

# Run example scenario
python -m mcp_server.server

# In another terminal, verify logs
tail -f log/mcp_server.log

# Check results
ls -la log/runs/
cat log/runs/run-*/report.json

Production Deployment

Docker

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["python", "-m", "mcp_server.server"]

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: mcp-server
        image: mcp-server:latest
        ports:
        - containerPort: 50051
        env:
        - name: MCP_GRPC__HOST
          value: "0.0.0.0"
        - name: MCP_HTTP__BASE_URL
          value: "http://remediation-api:8901"

Contributing

  1. Follow PEP 8 style guidelines
  2. Add type hints to all functions
  3. Write docstrings for public APIs
  4. Update tests for new features

License

MIT License - See LICENSE file for details

Support

For issues and questions:

  • GitHub Issues: https://github.com/your-org/mcp-server/issues
  • Documentation: https://docs.your-org.com/mcp-server

Roadmap

  • [ ] Full gRPC code generation from .proto files
  • [ ] WebSocket streaming for real-time events
  • [ ] Chaos Mesh integration
  • [ ] Prometheus metrics export
  • [ ] OpenTelemetry tracing
  • [ ] Multi-scenario parallel execution
  • [ ] Scenario templates and library

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured