FIS Recommender MCP Server

FIS Recommender MCP Server

Analyzes DevOps findings to automatically recommend and generate AWS Fault Injection Simulator (FIS) experiment templates. It helps teams validate system resilience by mapping reported issues to specific chaos engineering actions across AWS services.

Category
Visit Server

README

FIS Recommender MCP Server

An MCP (Model Context Protocol) server that automatically recommends AWS Fault Injection Simulator (FIS) experiments based on DevOps Agent findings. Helps teams quickly design chaos engineering experiments to validate system resilience.

Features

  • 🔍 Analyzes DevOps findings and suggests relevant FIS experiments
  • 🎯 Maps issues to appropriate fault injection actions
  • 📋 Generates complete FIS experiment templates
  • ⚡ Integrates seamlessly with Kiro CLI and other MCP clients

Installation

Clone the Repository

git clone https://github.com/pimisael/fis-recommender-mcp.git
cd fis-recommender-mcp
chmod +x server.py

Configure MCP Client

For Kiro CLI

Add to ~/.kiro/mcp-servers.json:

{
  "mcpServers": {
    "fis-recommender": {
      "command": "python3",
      "args": ["/absolute/path/to/fis-recommender-mcp/server.py"],
      "env": {
        "AWS_REGION": "us-east-1"
      }
    }
  }
}

For Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "fis-recommender": {
      "command": "python3",
      "args": ["/absolute/path/to/fis-recommender-mcp/server.py"],
      "env": {
        "AWS_REGION": "us-east-1"
      }
    }
  }
}

Usage Examples

Example 1: Network Latency Issue

Prompt:

I have a DevOps finding about network latency causing timeouts in my application. 
Can you recommend FIS experiments to test this?

Finding details:
- ID: finding-001
- Summary: "High network latency between services causing request timeouts"
- Type: NETWORK_ISSUE

Response: The MCP server will recommend:

  • Action: aws:network:disrupt-connectivity
  • Duration: 10 minutes
  • Target: Network interfaces
  • Stop condition: CloudWatch alarm on error rate

Example 2: Database Availability

Prompt:

Recommend FIS experiments for this finding:
{
  "id": "finding-db-001",
  "summary": "Database connection failures during peak load",
  "type": "DATABASE_ISSUE"
}

Response:

  • Action: aws:rds:reboot-db-instances
  • Duration: 2 minutes
  • Target: RDS instances
  • Tests application's database failover handling

Example 3: CPU Stress Testing

Prompt:

We had a CPU spike incident. Generate a FIS template to test our auto-scaling.

Finding: "CPU utilization reached 95% causing service degradation"

Response: Complete FIS experiment template with:

  • EC2 instance stop action
  • 3-minute duration
  • CloudWatch alarm stop condition
  • Target selection by tags

Example 4: Memory Pressure

Prompt:

Create FIS experiments to validate our memory monitoring:
- Finding ID: mem-leak-001
- Issue: Memory leak caused OOM errors
- Need to test alerting and recovery

Response:

  • Action: aws:ssm:send-command (memory stress)
  • Duration: 5 minutes
  • SSM document for memory consumption
  • Tests monitoring and auto-recovery

Standalone Testing

Run the example script to test without an MCP client:

python3 example.py

This will analyze sample findings and display recommendations.

Supported Finding Types

Network & Connectivity

Finding Keyword FIS Action Duration Use Case
network aws:network:disrupt-connectivity 5 min Test network partition handling
latency aws:network:disrupt-connectivity 10 min Validate timeout configurations
packet loss aws:ecs:task-network-packet-loss 5 min Simulate packet loss scenarios
vpc endpoint aws:network:disrupt-vpc-endpoint 5 min Test VPC endpoint failures
cross-region aws:network:route-table-disrupt-cross-region-connectivity 10 min Test multi-region connectivity
transit gateway aws:network:transit-gateway-disrupt-cross-region-connectivity 10 min Test transit gateway issues
direct connect aws:directconnect:virtual-interface-disconnect 5 min Test Direct Connect failures

Database & Storage

Finding Keyword FIS Action Duration Use Case
database aws:rds:reboot-db-instances 2 min Test database failover
rds aws:rds:failover-db-cluster 3 min Test RDS cluster failover
dynamodb aws:dynamodb:global-table-pause-replication 5 min Test DynamoDB replication pause
aurora dsql aws:dsql:cluster-connection-failure 5 min Test Aurora DSQL failures
disk aws:ebs:pause-volume-io 3 min Test disk I/O failures
ebs aws:ebs:volume-io-latency 5 min Inject EBS I/O latency
s3 replication aws:s3:bucket-pause-replication 10 min Test S3 replication pause

Compute & Instances

Finding Keyword FIS Action Duration Use Case
cpu aws:ec2:stop-instances 3 min Validate auto-scaling policies
memory aws:ssm:send-command 5 min Test OOM handling
instance aws:ec2:reboot-instances 2 min Test instance reboot resilience
spot aws:ec2:send-spot-instance-interruptions 2 min Test spot interruption handling
capacity aws:ec2:api-insufficient-instance-capacity-error 5 min Test capacity error handling
auto scaling aws:ec2:asg-insufficient-instance-capacity-error 5 min Test ASG capacity errors

ECS & Containers

Finding Keyword FIS Action Duration Use Case
ecs aws:ecs:stop-task 2 min Test ECS task failure recovery
container cpu aws:ecs:task-cpu-stress 5 min Inject CPU stress on tasks
container memory aws:ecs:task-io-stress 5 min Inject I/O stress on tasks
container network aws:ecs:task-network-latency 5 min Inject network latency on tasks
drain aws:ecs:drain-container-instances 5 min Test container draining

EKS & Kubernetes

Finding Keyword FIS Action Duration Use Case
eks aws:eks:pod-delete 2 min Test pod deletion recovery
pod cpu aws:eks:pod-cpu-stress 5 min Inject CPU stress on pods
pod memory aws:eks:pod-memory-stress 5 min Inject memory stress on pods
pod network aws:eks:pod-network-latency 5 min Inject network latency on pods
nodegroup aws:eks:terminate-nodegroup-instances 3 min Test node termination
kubernetes aws:eks:inject-kubernetes-custom-resource 5 min Inject custom K8s faults

Lambda & Serverless

Finding Keyword FIS Action Duration Use Case
lambda aws:lambda:invocation-error 5 min Inject Lambda errors
lambda latency aws:lambda:invocation-add-delay 5 min Add Lambda invocation delay
lambda http aws:lambda:invocation-http-integration-response 5 min Test Lambda HTTP failures

Lambda Chaos Engineering Best Practices

Testing Cold Starts and Timeouts:

  • Use aws:lambda:invocation-add-delay to simulate cold start scenarios
  • Set startupDelayMilliseconds higher than function timeout to test timeout handling
  • Validates retry logic, dead letter queues, and error handling

Error Handling Validation:

  • Use aws:lambda:invocation-error with preventExecution: true to test without running code
  • Set invocationPercentage to gradually increase fault injection (start at 10-20%)
  • Verify CloudWatch alarms fire and monitoring captures errors

Integration Testing:

  • Use aws:lambda:invocation-http-integration-response for ALB, API Gateway, VPC Lattice
  • Test upstream/downstream service behavior with custom HTTP status codes
  • Validate circuit breakers and fallback mechanisms

Continuous Testing in CI/CD:

  • Automate Lambda FIS experiments in AWS CodePipeline post-deployment
  • Use CloudWatch Synthetics to monitor user experience during experiments
  • Set stop conditions based on error rate thresholds (e.g., >5% errors)

Experiment Safety:

  • Start experiments in non-production with synthetic traffic
  • Use invocationPercentage parameter to limit blast radius
  • Configure CloudWatch alarms as stop conditions
  • Run during off-peak hours initially

Key Metrics to Monitor:

  • Invocation errors and throttles
  • Duration and billed duration
  • Concurrent executions
  • Dead letter queue messages
  • Downstream service health

Caching & Streaming

Finding Keyword FIS Action Duration Use Case
elasticache aws:elasticache:replicationgroup-interrupt-az-power 5 min Test ElastiCache AZ failure
memorydb aws:memorydb:multi-region-cluster-pause-replication 5 min Test MemoryDB replication
kinesis aws:kinesis:stream-provisioned-throughput-exception 5 min Test Kinesis throughput
kinesis iterator aws:kinesis:stream-expired-iterator-exception 3 min Test expired iterator handling

API & Throttling

Finding Keyword FIS Action Duration Use Case
api throttle aws:fis:inject-api-throttle-error 5 min Inject API throttling
api error aws:fis:inject-api-internal-error 5 min Inject API internal errors
api unavailable aws:fis:inject-api-unavailable-error 5 min Inject API unavailable errors

Availability & Recovery

Finding Keyword FIS Action Duration Use Case
availability aws:ec2:stop-instances 5 min Test high availability setup
zonal aws:arc:start-zonal-autoshift 10 min Test zonal autoshift
alarm aws:cloudwatch:assert-alarm-state 1 min Validate alarm states

Available Tools

1. recommend_fis_experiments

Analyzes DevOps Agent findings and returns FIS experiment recommendations.

Input:

{
  "finding": {
    "id": "finding-123",
    "summary": "Network latency caused timeouts",
    "type": "AVAILABILITY_ISSUE"
  }
}

Output:

{
  "recommendations": [
    {
      "action": "aws:network:disrupt-connectivity",
      "duration": "PT10M",
      "description": "Simulates network disruption to test timeout handling",
      "targets": ["NetworkInterface"],
      "stopConditions": ["CloudWatch alarm on error rate > 5%"]
    }
  ],
  "finding_id": "finding-123",
  "count": 1
}

2. create_fis_template

Generates a complete, ready-to-deploy FIS experiment template.

Input:

{
  "recommendation": {
    "action": "aws:ec2:stop-instances",
    "duration": "PT3M",
    "description": "Test instance failure recovery"
  },
  "target_config": {
    "resourceType": "aws:ec2:instance",
    "selectionMode": "COUNT(1)",
    "tags": {
      "Environment": "staging",
      "Team": "platform"
    },
    "roleArn": "arn:aws:iam::123456789012:role/FISRole"
  }
}

Output: Complete CloudFormation-compatible FIS experiment template ready for deployment.

Customization

Adding New Finding Mappings

Edit server.py and add to the finding_mappings dictionary:

finding_mappings = {
    "disk": {
        "action": "aws:ebs:pause-volume-io",
        "duration": "PT5M",
        "description": "Simulates disk I/O issues"
    },
    # Add your custom mappings here
}

Adjusting Durations

Modify duration values in ISO 8601 format:

  • PT2M = 2 minutes
  • PT5M = 5 minutes
  • PT10M = 10 minutes
  • PT1H = 1 hour

Requirements

  • Python 3.7+
  • AWS credentials configured (for actual FIS deployment)
  • MCP-compatible client (Kiro CLI, Claude Desktop, etc.)

Chaos Engineering Best Practices

The Chaos Engineering Flywheel

Follow the scientific method for each experiment:

  1. Define Steady State - Establish measurable baseline metrics (TPS, latency, error rate)
  2. Form Hypothesis - Predict how the system will respond to the fault
  3. Run Experiment - Inject the fault in a controlled manner
  4. Verify Results - Compare actual behavior against hypothesis
  5. Improve - Address gaps and re-run experiments

Experiment Safety Guidelines

Start Small, Scale Gradually:

  • Begin in non-production environments
  • Use synthetic traffic before real customer traffic
  • Start with low percentages (10-20%) and increase gradually
  • Run during off-peak hours initially

Implement Guardrails:

  • Set CloudWatch alarms as stop conditions
  • Define clear rollback procedures
  • Monitor blast radius with real-time dashboards
  • Communicate with operations teams before experiments

Scope and Impact:

  • Clearly define experiment boundaries
  • Use tags to target specific resources
  • Limit concurrent experiments
  • Document expected vs. actual impact

Continuous Chaos Testing

Automate in CI/CD:

  • Integrate FIS experiments into AWS CodePipeline
  • Run experiments post-deployment automatically
  • Use results to gate production releases
  • Track experiment results over time

Game Days:

  • Schedule regular chaos engineering sessions
  • Simulate realistic failure scenarios
  • Test incident response procedures
  • Validate runbooks and documentation

Key Metrics to Track

System Health:

  • Request success rate (target: >99.9%)
  • Latency percentiles (p50, p95, p99)
  • Error rates (4xx, 5xx)
  • Resource utilization (CPU, memory, connections)

Resilience Indicators:

  • Time to detect failures
  • Time to recovery
  • Blast radius of failures
  • Cascading failure prevention

Common Failure Scenarios

Network Failures:

  • Partition tolerance between services
  • Cross-region connectivity loss
  • DNS resolution failures
  • Increased latency and packet loss

Resource Exhaustion:

  • CPU and memory pressure
  • Connection pool exhaustion
  • Disk I/O saturation
  • API throttling and rate limits

Dependency Failures:

  • Database failover and replication lag
  • Cache invalidation and cold starts
  • Third-party API unavailability
  • Message queue backlogs

References

License

MIT

Contributing

Issues and pull requests welcome at https://github.com/pimisael/fis-recommender-mcp

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured