Stateful MCP Server on ECS Fargate

Stateful MCP Server on ECS Fargate

Enables MCP sessions to survive ECS Fargate deployments by using Redis-backed tool state and stateless Streamable HTTP transport.

Category
Visit Server

README

Stateful MCP Servers on ECS Fargate

This repository is an end-to-end practical test of a production question:

Can a stateful MCP server survive ECS Fargate force deployments?

The final answer from the live test is:

Yes, but Redis-backed tool state alone is not enough. The MCP Streamable HTTP transport session registry must also stop depending on one container's memory. In this implementation we use stateless Streamable HTTP plus Redis-backed logical session state.

This project was based on the experiment from:

https://github.com/AvinashDalvi89/stateful-mcp-on-ecs-fargate-example

We extended it, deployed it on AWS, reproduced the problem, fixed it, force deployed again, captured AWS Console evidence, and then tore the resources down.

What Is MCP?

MCP means Model Context Protocol.

It is a protocol that lets an AI client call external tools, read resources, and interact with systems in a structured way. Instead of hard-coding every integration inside the model or client, MCP gives the client a standard way to talk to an MCP server.

In this project:

  • The MCP client is the Python test client in test_client/test_client.py.
  • The MCP server is the FastMCP application running inside ECS Fargate.
  • The MCP tools are Python functions exposed from src/tools.py.
  • The MCP transport is Streamable HTTP at /mcp.
  • The logical session state is the per-client data written and read by the MCP tools.

Simple flow:

MCP client -> HTTP /mcp -> FastMCP server -> MCP tool -> session store

The important point is that MCP can have state at more than one layer. The tool can have state, but the MCP HTTP transport can also have its own session handling. That is exactly where the ECS Fargate deployment problem appeared.

Stateful And Stateless In This Project

Stateful

Stateful means the server remembers something from earlier requests.

In this project, the original stateful behavior had two kinds of state:

  1. Application/tool state

    • Example: values written by set_session_value().
    • This is the data the MCP tools need across requests.
    • In the original version, this lived in container memory.
  2. MCP transport session state

    • Example: the Streamable HTTP session known by the FastMCP session manager.
    • This is internal transport-level state used before the tool handler runs.
    • In the default stateful Streamable HTTP mode, this also lived in container memory.

That works on one container. It becomes risky on ECS Fargate because Fargate tasks are temporary.

When ECS replaces a task:

Old task memory disappears
New task has no memory of old transport sessions
Client may be routed to the new task
Default stateful Streamable HTTP can return "Session not found"

Stateless

Stateless means a single container does not need to remember transport session data between requests.

In our final working version:

  • FastMCP runs with stateless_http=True.
  • The server does not depend on one task's in-memory Streamable HTTP session registry.
  • The client sends a stable logical Mcp-Session-Id.
  • Tool state is stored in external Redis, not inside the Fargate task.

So the container can be replaced, but the session data still exists outside the container.

Old ECS task dies
New ECS task receives request
Client sends same logical Mcp-Session-Id
New task reads logical session state from Redis
Request succeeds

Why Redis Alone Was Not Enough

Redis solved only the application/tool state problem.

The first Redis-backed attempt still failed because the default FastMCP Streamable HTTP transport session registry was still in memory inside each task. When the ALB sent the client to a new task, the new task did not know the old MCP transport session and returned:

Session not found

That happened before our Redis-backed tool code could help.

The fix was to combine both:

Stateless MCP Streamable HTTP
+ external Redis logical session state

The Problem

ECS Fargate tasks are disposable. During a force deployment or rolling deployment:

  1. ECS starts new replacement tasks.
  2. The new tasks register with the ALB target group.
  3. Old tasks are deregistered and enter draining.
  4. ECS sends SIGTERM to old containers.
  5. Old container memory disappears.

If an MCP server keeps session data only in process memory, the session can break when the client is routed to a new task.

ALB sticky sessions can delay this problem, but they do not solve it. When a target is draining, unhealthy, or removed, the ALB can route the client to a different task.

What We Tested

The original experiment used:

  • FastMCP server on ECS Fargate.
  • Streamable HTTP endpoint at /mcp.
  • ALB with sticky sessions.
  • ECS rolling deployment.
  • In-memory session store.
  • A client that repeatedly calls MCP tools.

We added:

  • ElastiCache Redis.
  • Redis-backed MCP tool state.
  • Health endpoint proof showing the active session backend.
  • Stateless Streamable HTTP mode.
  • Client-generated logical Mcp-Session-Id.
  • Evidence capture from AWS Console, CloudWatch, ECS, ALB, and test-client logs.

Key Discovery

The first Redis attempt still failed.

Redis moved our application/tool state out of the task, but FastMCP's stateful Streamable HTTP transport still kept active MCP transport sessions in process memory. When traffic moved to a new ECS task, the new task did not recognize the old Mcp-Session-Id and returned:

Session not found

That error happened before our tool handler ran, which means Redis-backed tool state was necessary but not sufficient.

The working solution was:

MCP_STATELESS_HTTP=true
+ Redis-backed logical session state
+ client-provided Mcp-Session-Id

Final Architecture

The live AWS test used these services:

  • Amazon ECR to store the Docker image.
  • AWS CloudFormation to create and delete infrastructure.
  • Amazon VPC with public and private subnets.
  • Application Load Balancer to receive HTTP traffic.
  • Amazon ECS on AWS Fargate to run the MCP server containers.
  • Amazon ElastiCache for Redis to store logical MCP session data outside the tasks.
  • Amazon CloudWatch Logs to capture application logs.

GitHub Markdown does not render the official AWS icon pack directly unless those icons are added as image assets. To keep this README portable, the architecture below uses AWS service names in a Mermaid diagram that renders directly on GitHub.

flowchart LR
    client["MCP Test Client"]
    ecr["Amazon ECR\nmcp-fargate-server image"]
    cfn["AWS CloudFormation\nmcp-infrastructure + mcp-ecs"]
    alb["Application Load Balancer\npublic subnets"]
    tg["ALB Target Group\nsticky sessions enabled"]
    ecs["Amazon ECS Service\nmcp-fargate-service"]
    taskA["AWS Fargate Task A\nFastMCP server\n/mcp + /health"]
    taskB["AWS Fargate Task B\nFastMCP server\n/mcp + /health"]
    redis["Amazon ElastiCache Redis\nexternal logical session store"]
    logs["Amazon CloudWatch Logs\ncontainer logs"]

    client -->|"HTTP /mcp and /health"| alb
    alb --> tg
    tg --> taskA
    tg --> taskB
    ecs --> taskA
    ecs --> taskB
    taskA -->|"read/write Mcp-Session-Id state"| redis
    taskB -->|"read/write Mcp-Session-Id state"| redis
    taskA --> logs
    taskB --> logs
    ecr -->|"container image"| ecs
    cfn -->|"creates"| alb
    cfn -->|"creates"| ecs
    cfn -->|"creates"| redis

Network Layout

flowchart TB
    internet["Internet / local test client"]

    subgraph vpc["AWS VPC"]
        subgraph public["Public subnets"]
            alb["Application Load Balancer"]
            nat["NAT Gateway"]
        end

        subgraph private["Private subnets"]
            task1["Fargate task\nFastMCP container"]
            task2["Fargate task\nFastMCP container"]
            redis["ElastiCache Redis"]
        end
    end

    internet --> alb
    alb --> task1
    alb --> task2
    task1 --> redis
    task2 --> redis
    task1 --> nat
    task2 --> nat

Security group intent:

  • ALB accepts HTTP from the client.
  • ECS tasks accept traffic only from the ALB security group.
  • Redis accepts traffic only from the ECS task security group.
  • Redis is not public.

Request Flow After The Fix

sequenceDiagram
    participant Client as MCP client
    participant ALB as Application Load Balancer
    participant TaskA as Fargate task A
    participant TaskB as Fargate task B
    participant Redis as ElastiCache Redis

    Client->>ALB: initialize /mcp
    ALB->>TaskA: route request
    TaskA-->>Client: stateless response
    Client->>ALB: call set_session_value with Mcp-Session-Id
    ALB->>TaskA: route request
    TaskA->>Redis: write logical session state
    Redis-->>TaskA: OK
    TaskA-->>Client: tool result
    Note over TaskA,TaskB: ECS force new deployment starts
    ALB->>TaskB: later request routed to new task
    TaskB->>Redis: read same logical Mcp-Session-Id state
    Redis-->>TaskB: session data
    TaskB-->>Client: HTTP 200 tool result

ECS Force Deployment Flow

flowchart TD
    start["Service stable\n2 desired tasks running"]
    force["Force new deployment"]
    newTask["ECS starts replacement task"]
    healthy["New task passes ALB health checks"]
    drain["Old task enters target group draining"]
    traffic["ALB routes traffic to healthy task"]
    redisState["New task reads logical session from Redis"]
    success["Client continues with HTTP 200\nNo Session not found errors"]

    start --> force
    force --> newTask
    newTask --> healthy
    healthy --> drain
    drain --> traffic
    traffic --> redisState
    redisState --> success
Client
  -> Application Load Balancer
  -> ECS Fargate service
       -> Task A: FastMCP server
       -> Task B: FastMCP server
  -> ElastiCache Redis

Redis is outside the Fargate task. This is important.

Do not run Redis as a sidecar inside the same Fargate task for this use case. A sidecar Redis container dies with the task and does not solve deployment replacement.

Final Result

During the successful force-deployment test:

{
  "http_status_counts": {
    "200": 141
  },
  "unique_task_ids": [
    "eb83d8d37aa448758abe33e410d17864",
    "8454af6040484b64b252adf5d0448fff"
  ],
  "first_task_id": "eb83d8d37aa448758abe33e410d17864",
  "last_task_id": "8454af6040484b64b252adf5d0448fff",
  "max_state_size": 70,
  "session_not_found_count": 0,
  "error_rows": 0,
  "session_complete": true
}

This proves:

  • The client crossed from one ECS task to another.
  • The session continued after task replacement.
  • Redis state accumulated up to 70 keys.
  • Every MCP request returned HTTP 200.
  • There were zero Session not found errors.
  • The session completed during ECS deployment replacement.

What Changed In The Code

Redis And Memory Session Stores

src/session_store.py now contains:

  • SessionStore: common interface.
  • InMemorySessionStore: local/demo backend.
  • RedisSessionStore: shared backend for ECS tasks.
  • create_session_store(): selects backend from environment.

Backend selection:

REDIS_URL set     -> RedisSessionStore
REDIS_URL missing -> InMemorySessionStore

Stateless Streamable HTTP

src/server.py reads:

MCP_STATELESS_HTTP=true

When enabled, FastMCP starts with:

mcp.http_app(stateless_http=True)

This avoids depending on a per-task in-memory Streamable HTTP transport session registry.

Logical MCP Session ID

In stateless mode, the server does not issue a transport session ID. The test client generates a stable logical session ID:

client-<uuid>

It sends this value on every tool call:

Mcp-Session-Id: client-...

FastMCP exposes that header through ctx.session_id, and our tools use it as the Redis key.

Lazy Session Creation

set_session_value() creates the logical Redis session on first write if the key does not exist.

get_session_state() still fails for a never-seen session, which keeps reads honest.

Health Endpoint

/health now returns the active backend:

{
  "status": "healthy",
  "active_sessions": 0,
  "session_store": "redis"
}

AWS Infrastructure

sam/infrastructure.yaml creates:

  • VPC
  • Public subnets
  • Private subnets
  • NAT gateway
  • Application Load Balancer
  • ALB target group
  • ALB sticky sessions
  • ECR repository
  • CloudWatch log groups
  • ElastiCache Redis
  • Redis security group allowing inbound traffic only from ECS tasks

sam/ecs.yaml creates:

  • ECS cluster
  • ECS task execution role
  • Fargate task definition
  • ECS service
  • ALB service attachment

The container receives:

REDIS_URL=redis://<elasticache-endpoint>:6379/0
SESSION_TTL_SECONDS=86400
MCP_STATELESS_HTTP=true

Repository Layout

.
|-- Dockerfile
|-- Makefile
|-- README.md
|-- evidence/
|   |-- README.md
|   |-- screenshots/
|   |-- 08-health.json
|   |-- 12-client-during-force-deploy.jsonl
|   |-- 21-client-stateless-force-deploy.jsonl
|   |-- 25-stateless-client-summary.json
|   `-- 28-cloudwatch-tail.txt
|-- sam/
|   |-- infrastructure.yaml
|   `-- ecs.yaml
|-- src/
|   |-- server.py
|   |-- session_store.py
|   |-- tools.py
|   |-- health.py
|   `-- shutdown.py
`-- test_client/
    `-- test_client.py

Prerequisites

  • AWS CLI configured with permissions for ECS, ECR, CloudFormation, EC2, ELB, CloudWatch Logs, IAM, and ElastiCache.
  • Docker Desktop or Docker Engine.
  • Python 3.12+.
  • AWS region used in this test: ap-south-1.

SAM is optional. The Makefile uses sam deploy, but this test was also run with direct aws cloudformation deploy.

Deploy

1. Deploy Infrastructure

make deploy-infra

AWS CLI equivalent:

aws cloudformation deploy \
  --region ap-south-1 \
  --template-file sam/infrastructure.yaml \
  --stack-name mcp-infrastructure \
  --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
  --no-fail-on-empty-changeset

2. Build And Push Image

make build IMAGE_TAG=redis

3. Deploy ECS

make deploy-ecs IMAGE_TAG=redis

4. Verify Health

curl http://<ALB_DNS_NAME>/health

Expected:

{
  "status": "healthy",
  "session_store": "redis"
}

Run The Force Deployment Test

Start the client:

python test_client/test_client.py \
  --endpoint http://<ALB_DNS_NAME>/mcp \
  --calls 70 \
  --delay 2

While the client is running, force a new ECS deployment:

aws ecs update-service \
  --region ap-south-1 \
  --cluster mcp-fargate-cluster \
  --service mcp-fargate-service \
  --force-new-deployment

Wait for service stability:

aws ecs wait services-stable \
  --region ap-south-1 \
  --cluster mcp-fargate-cluster \
  --services mcp-fargate-service

Then inspect:

evidence/21-client-stateless-force-deploy.jsonl
evidence/25-stateless-client-summary.json

Evidence

Evidence is included in evidence/.

Important files:

  • 08-health.json: live /health endpoint showing Redis mode.
  • 12-client-during-force-deploy.jsonl: Redis-only attempt that still hit transport-level session failure.
  • 21-client-stateless-force-deploy.jsonl: final successful stateless+Redis run.
  • 25-stateless-client-summary.json: parsed success summary.
  • 28-cloudwatch-tail.txt: ECS task logs from CloudWatch.
  • screenshots/: AWS Console screenshots.

See evidence/README.md for a detailed evidence map.

AWS Console Evidence Captured

The screenshots show:

  • ECR image pushed.
  • ECS task definition revision.
  • ECS service healthy before deployment test.
  • ECS logs from task.
  • Force new deployment menu.
  • Deployment in progress.
  • ALB target group draining old target.
  • ECR image after rebuild.
  • Revision 2 deployment in progress.
  • Revision 2 deployment success.
  • CloudFormation teardown in progress.

Click any screenshot below to open the full-size evidence image.

1. ECR image pushed

ECR repository with initial Redis image

This proves the container image was pushed to Amazon ECR before ECS deployment.

2. ECS task definition revision 1

ECS task definition revision 1

This proves the Fargate task definition was created and active.

3. ECS service healthy before deployment

ECS service healthy on revision 1

This proves the ECS service was running and healthy before forcing a new deployment.

4. ECS task logs

ECS task logs showing application startup and health checks

This proves the FastMCP container started successfully and was receiving ALB health checks.

5. Force new deployment action

ECS force new deployment menu

This proves the deployment replacement test was triggered from the ECS console.

6. Deployment in progress with pending task

ECS deployment in progress with one pending task

This proves ECS started a replacement task while the service was still serving traffic.

7. Deployment in progress with three running tasks

ECS deployment in progress with three running tasks

This proves old and new tasks overlapped during the rolling deployment.

8. Target group draining old task

ALB target group with old target draining

This proves the old Fargate task entered draining while new healthy targets were available.

9. ECR images after rebuild

ECR repository with two images after rebuild

This proves a second image was pushed for the stateless HTTP fix.

10. Stateless deployment in progress

Stateless deployment in progress on revision 2

This proves the fixed revision was rolled out through ECS.

11. Stateless deployment success

Stateless deployment success with two healthy targets

This proves the final ECS deployment reached success with healthy targets.

12. CloudFormation teardown

CloudFormation stack delete in progress

This proves the test resources were cleaned up after evidence capture.

Teardown

Delete ECS first:

aws cloudformation delete-stack \
  --region ap-south-1 \
  --stack-name mcp-ecs

Then delete infrastructure:

aws cloudformation delete-stack \
  --region ap-south-1 \
  --stack-name mcp-infrastructure

If CloudFormation cannot delete ECR because images still exist:

aws ecr list-images \
  --region ap-south-1 \
  --repository-name mcp-fargate-server

aws ecr batch-delete-image \
  --region ap-south-1 \
  --repository-name mcp-fargate-server \
  --image-ids imageDigest=<digest>

In the captured test run, teardown completed after deleting the remaining ECR images.

Production Notes

  • Use ElastiCache with Multi-AZ or MemoryDB for stronger production durability.
  • Enable encryption in transit and Redis authentication for production.
  • Put Redis in private subnets.
  • Allow Redis inbound traffic only from the ECS task security group.
  • Do not rely on ALB sticky sessions as your durability layer.
  • Keep MCP tool operations idempotent where possible.
  • For server-sent event resumability, consider external event storage as a separate concern.

Main Lesson

For MCP on ECS Fargate, there are two different kinds of state:

  1. Application/tool state.
  2. MCP transport/session-manager state.

Moving only application state to Redis can still fail if the transport session manager is stateful and in memory.

This repository demonstrates a practical ECS-safe pattern:

stateless Streamable HTTP + external Redis logical session state

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured