Stateful MCP Server on ECS Fargate
Enables MCP sessions to survive ECS Fargate deployments by using Redis-backed tool state and stateless Streamable HTTP transport.
README
Stateful MCP Servers on ECS Fargate
This repository is an end-to-end practical test of a production question:
Can a stateful MCP server survive ECS Fargate force deployments?
The final answer from the live test is:
Yes, but Redis-backed tool state alone is not enough. The MCP Streamable HTTP transport session registry must also stop depending on one container's memory. In this implementation we use stateless Streamable HTTP plus Redis-backed logical session state.
This project was based on the experiment from:
https://github.com/AvinashDalvi89/stateful-mcp-on-ecs-fargate-example
We extended it, deployed it on AWS, reproduced the problem, fixed it, force deployed again, captured AWS Console evidence, and then tore the resources down.
What Is MCP?
MCP means Model Context Protocol.
It is a protocol that lets an AI client call external tools, read resources, and interact with systems in a structured way. Instead of hard-coding every integration inside the model or client, MCP gives the client a standard way to talk to an MCP server.
In this project:
- The MCP client is the Python test client in
test_client/test_client.py. - The MCP server is the FastMCP application running inside ECS Fargate.
- The MCP tools are Python functions exposed from
src/tools.py. - The MCP transport is Streamable HTTP at
/mcp. - The logical session state is the per-client data written and read by the MCP tools.
Simple flow:
MCP client -> HTTP /mcp -> FastMCP server -> MCP tool -> session store
The important point is that MCP can have state at more than one layer. The tool can have state, but the MCP HTTP transport can also have its own session handling. That is exactly where the ECS Fargate deployment problem appeared.
Stateful And Stateless In This Project
Stateful
Stateful means the server remembers something from earlier requests.
In this project, the original stateful behavior had two kinds of state:
-
Application/tool state
- Example: values written by
set_session_value(). - This is the data the MCP tools need across requests.
- In the original version, this lived in container memory.
- Example: values written by
-
MCP transport session state
- Example: the Streamable HTTP session known by the FastMCP session manager.
- This is internal transport-level state used before the tool handler runs.
- In the default stateful Streamable HTTP mode, this also lived in container memory.
That works on one container. It becomes risky on ECS Fargate because Fargate tasks are temporary.
When ECS replaces a task:
Old task memory disappears
New task has no memory of old transport sessions
Client may be routed to the new task
Default stateful Streamable HTTP can return "Session not found"
Stateless
Stateless means a single container does not need to remember transport session data between requests.
In our final working version:
- FastMCP runs with
stateless_http=True. - The server does not depend on one task's in-memory Streamable HTTP session registry.
- The client sends a stable logical
Mcp-Session-Id. - Tool state is stored in external Redis, not inside the Fargate task.
So the container can be replaced, but the session data still exists outside the container.
Old ECS task dies
New ECS task receives request
Client sends same logical Mcp-Session-Id
New task reads logical session state from Redis
Request succeeds
Why Redis Alone Was Not Enough
Redis solved only the application/tool state problem.
The first Redis-backed attempt still failed because the default FastMCP Streamable HTTP transport session registry was still in memory inside each task. When the ALB sent the client to a new task, the new task did not know the old MCP transport session and returned:
Session not found
That happened before our Redis-backed tool code could help.
The fix was to combine both:
Stateless MCP Streamable HTTP
+ external Redis logical session state
The Problem
ECS Fargate tasks are disposable. During a force deployment or rolling deployment:
- ECS starts new replacement tasks.
- The new tasks register with the ALB target group.
- Old tasks are deregistered and enter draining.
- ECS sends
SIGTERMto old containers. - Old container memory disappears.
If an MCP server keeps session data only in process memory, the session can break when the client is routed to a new task.
ALB sticky sessions can delay this problem, but they do not solve it. When a target is draining, unhealthy, or removed, the ALB can route the client to a different task.
What We Tested
The original experiment used:
- FastMCP server on ECS Fargate.
- Streamable HTTP endpoint at
/mcp. - ALB with sticky sessions.
- ECS rolling deployment.
- In-memory session store.
- A client that repeatedly calls MCP tools.
We added:
- ElastiCache Redis.
- Redis-backed MCP tool state.
- Health endpoint proof showing the active session backend.
- Stateless Streamable HTTP mode.
- Client-generated logical
Mcp-Session-Id. - Evidence capture from AWS Console, CloudWatch, ECS, ALB, and test-client logs.
Key Discovery
The first Redis attempt still failed.
Redis moved our application/tool state out of the task, but FastMCP's stateful Streamable HTTP transport still kept active MCP transport sessions in process memory. When traffic moved to a new ECS task, the new task did not recognize the old Mcp-Session-Id and returned:
Session not found
That error happened before our tool handler ran, which means Redis-backed tool state was necessary but not sufficient.
The working solution was:
MCP_STATELESS_HTTP=true
+ Redis-backed logical session state
+ client-provided Mcp-Session-Id
Final Architecture
The live AWS test used these services:
- Amazon ECR to store the Docker image.
- AWS CloudFormation to create and delete infrastructure.
- Amazon VPC with public and private subnets.
- Application Load Balancer to receive HTTP traffic.
- Amazon ECS on AWS Fargate to run the MCP server containers.
- Amazon ElastiCache for Redis to store logical MCP session data outside the tasks.
- Amazon CloudWatch Logs to capture application logs.
GitHub Markdown does not render the official AWS icon pack directly unless those icons are added as image assets. To keep this README portable, the architecture below uses AWS service names in a Mermaid diagram that renders directly on GitHub.
flowchart LR
client["MCP Test Client"]
ecr["Amazon ECR\nmcp-fargate-server image"]
cfn["AWS CloudFormation\nmcp-infrastructure + mcp-ecs"]
alb["Application Load Balancer\npublic subnets"]
tg["ALB Target Group\nsticky sessions enabled"]
ecs["Amazon ECS Service\nmcp-fargate-service"]
taskA["AWS Fargate Task A\nFastMCP server\n/mcp + /health"]
taskB["AWS Fargate Task B\nFastMCP server\n/mcp + /health"]
redis["Amazon ElastiCache Redis\nexternal logical session store"]
logs["Amazon CloudWatch Logs\ncontainer logs"]
client -->|"HTTP /mcp and /health"| alb
alb --> tg
tg --> taskA
tg --> taskB
ecs --> taskA
ecs --> taskB
taskA -->|"read/write Mcp-Session-Id state"| redis
taskB -->|"read/write Mcp-Session-Id state"| redis
taskA --> logs
taskB --> logs
ecr -->|"container image"| ecs
cfn -->|"creates"| alb
cfn -->|"creates"| ecs
cfn -->|"creates"| redis
Network Layout
flowchart TB
internet["Internet / local test client"]
subgraph vpc["AWS VPC"]
subgraph public["Public subnets"]
alb["Application Load Balancer"]
nat["NAT Gateway"]
end
subgraph private["Private subnets"]
task1["Fargate task\nFastMCP container"]
task2["Fargate task\nFastMCP container"]
redis["ElastiCache Redis"]
end
end
internet --> alb
alb --> task1
alb --> task2
task1 --> redis
task2 --> redis
task1 --> nat
task2 --> nat
Security group intent:
- ALB accepts HTTP from the client.
- ECS tasks accept traffic only from the ALB security group.
- Redis accepts traffic only from the ECS task security group.
- Redis is not public.
Request Flow After The Fix
sequenceDiagram
participant Client as MCP client
participant ALB as Application Load Balancer
participant TaskA as Fargate task A
participant TaskB as Fargate task B
participant Redis as ElastiCache Redis
Client->>ALB: initialize /mcp
ALB->>TaskA: route request
TaskA-->>Client: stateless response
Client->>ALB: call set_session_value with Mcp-Session-Id
ALB->>TaskA: route request
TaskA->>Redis: write logical session state
Redis-->>TaskA: OK
TaskA-->>Client: tool result
Note over TaskA,TaskB: ECS force new deployment starts
ALB->>TaskB: later request routed to new task
TaskB->>Redis: read same logical Mcp-Session-Id state
Redis-->>TaskB: session data
TaskB-->>Client: HTTP 200 tool result
ECS Force Deployment Flow
flowchart TD
start["Service stable\n2 desired tasks running"]
force["Force new deployment"]
newTask["ECS starts replacement task"]
healthy["New task passes ALB health checks"]
drain["Old task enters target group draining"]
traffic["ALB routes traffic to healthy task"]
redisState["New task reads logical session from Redis"]
success["Client continues with HTTP 200\nNo Session not found errors"]
start --> force
force --> newTask
newTask --> healthy
healthy --> drain
drain --> traffic
traffic --> redisState
redisState --> success
Client
-> Application Load Balancer
-> ECS Fargate service
-> Task A: FastMCP server
-> Task B: FastMCP server
-> ElastiCache Redis
Redis is outside the Fargate task. This is important.
Do not run Redis as a sidecar inside the same Fargate task for this use case. A sidecar Redis container dies with the task and does not solve deployment replacement.
Final Result
During the successful force-deployment test:
{
"http_status_counts": {
"200": 141
},
"unique_task_ids": [
"eb83d8d37aa448758abe33e410d17864",
"8454af6040484b64b252adf5d0448fff"
],
"first_task_id": "eb83d8d37aa448758abe33e410d17864",
"last_task_id": "8454af6040484b64b252adf5d0448fff",
"max_state_size": 70,
"session_not_found_count": 0,
"error_rows": 0,
"session_complete": true
}
This proves:
- The client crossed from one ECS task to another.
- The session continued after task replacement.
- Redis state accumulated up to 70 keys.
- Every MCP request returned HTTP
200. - There were zero
Session not founderrors. - The session completed during ECS deployment replacement.
What Changed In The Code
Redis And Memory Session Stores
src/session_store.py now contains:
SessionStore: common interface.InMemorySessionStore: local/demo backend.RedisSessionStore: shared backend for ECS tasks.create_session_store(): selects backend from environment.
Backend selection:
REDIS_URL set -> RedisSessionStore
REDIS_URL missing -> InMemorySessionStore
Stateless Streamable HTTP
src/server.py reads:
MCP_STATELESS_HTTP=true
When enabled, FastMCP starts with:
mcp.http_app(stateless_http=True)
This avoids depending on a per-task in-memory Streamable HTTP transport session registry.
Logical MCP Session ID
In stateless mode, the server does not issue a transport session ID. The test client generates a stable logical session ID:
client-<uuid>
It sends this value on every tool call:
Mcp-Session-Id: client-...
FastMCP exposes that header through ctx.session_id, and our tools use it as the Redis key.
Lazy Session Creation
set_session_value() creates the logical Redis session on first write if the key does not exist.
get_session_state() still fails for a never-seen session, which keeps reads honest.
Health Endpoint
/health now returns the active backend:
{
"status": "healthy",
"active_sessions": 0,
"session_store": "redis"
}
AWS Infrastructure
sam/infrastructure.yaml creates:
- VPC
- Public subnets
- Private subnets
- NAT gateway
- Application Load Balancer
- ALB target group
- ALB sticky sessions
- ECR repository
- CloudWatch log groups
- ElastiCache Redis
- Redis security group allowing inbound traffic only from ECS tasks
sam/ecs.yaml creates:
- ECS cluster
- ECS task execution role
- Fargate task definition
- ECS service
- ALB service attachment
The container receives:
REDIS_URL=redis://<elasticache-endpoint>:6379/0
SESSION_TTL_SECONDS=86400
MCP_STATELESS_HTTP=true
Repository Layout
.
|-- Dockerfile
|-- Makefile
|-- README.md
|-- evidence/
| |-- README.md
| |-- screenshots/
| |-- 08-health.json
| |-- 12-client-during-force-deploy.jsonl
| |-- 21-client-stateless-force-deploy.jsonl
| |-- 25-stateless-client-summary.json
| `-- 28-cloudwatch-tail.txt
|-- sam/
| |-- infrastructure.yaml
| `-- ecs.yaml
|-- src/
| |-- server.py
| |-- session_store.py
| |-- tools.py
| |-- health.py
| `-- shutdown.py
`-- test_client/
`-- test_client.py
Prerequisites
- AWS CLI configured with permissions for ECS, ECR, CloudFormation, EC2, ELB, CloudWatch Logs, IAM, and ElastiCache.
- Docker Desktop or Docker Engine.
- Python 3.12+.
- AWS region used in this test:
ap-south-1.
SAM is optional. The Makefile uses sam deploy, but this test was also run with direct aws cloudformation deploy.
Deploy
1. Deploy Infrastructure
make deploy-infra
AWS CLI equivalent:
aws cloudformation deploy \
--region ap-south-1 \
--template-file sam/infrastructure.yaml \
--stack-name mcp-infrastructure \
--capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
--no-fail-on-empty-changeset
2. Build And Push Image
make build IMAGE_TAG=redis
3. Deploy ECS
make deploy-ecs IMAGE_TAG=redis
4. Verify Health
curl http://<ALB_DNS_NAME>/health
Expected:
{
"status": "healthy",
"session_store": "redis"
}
Run The Force Deployment Test
Start the client:
python test_client/test_client.py \
--endpoint http://<ALB_DNS_NAME>/mcp \
--calls 70 \
--delay 2
While the client is running, force a new ECS deployment:
aws ecs update-service \
--region ap-south-1 \
--cluster mcp-fargate-cluster \
--service mcp-fargate-service \
--force-new-deployment
Wait for service stability:
aws ecs wait services-stable \
--region ap-south-1 \
--cluster mcp-fargate-cluster \
--services mcp-fargate-service
Then inspect:
evidence/21-client-stateless-force-deploy.jsonl
evidence/25-stateless-client-summary.json
Evidence
Evidence is included in evidence/.
Important files:
08-health.json: live/healthendpoint showing Redis mode.12-client-during-force-deploy.jsonl: Redis-only attempt that still hit transport-level session failure.21-client-stateless-force-deploy.jsonl: final successful stateless+Redis run.25-stateless-client-summary.json: parsed success summary.28-cloudwatch-tail.txt: ECS task logs from CloudWatch.screenshots/: AWS Console screenshots.
See evidence/README.md for a detailed evidence map.
AWS Console Evidence Captured
The screenshots show:
- ECR image pushed.
- ECS task definition revision.
- ECS service healthy before deployment test.
- ECS logs from task.
- Force new deployment menu.
- Deployment in progress.
- ALB target group draining old target.
- ECR image after rebuild.
- Revision 2 deployment in progress.
- Revision 2 deployment success.
- CloudFormation teardown in progress.
Click any screenshot below to open the full-size evidence image.
1. ECR image pushed
This proves the container image was pushed to Amazon ECR before ECS deployment.
2. ECS task definition revision 1
This proves the Fargate task definition was created and active.
3. ECS service healthy before deployment
This proves the ECS service was running and healthy before forcing a new deployment.
4. ECS task logs
This proves the FastMCP container started successfully and was receiving ALB health checks.
5. Force new deployment action
This proves the deployment replacement test was triggered from the ECS console.
6. Deployment in progress with pending task
This proves ECS started a replacement task while the service was still serving traffic.
7. Deployment in progress with three running tasks
This proves old and new tasks overlapped during the rolling deployment.
8. Target group draining old task
This proves the old Fargate task entered draining while new healthy targets were available.
9. ECR images after rebuild
This proves a second image was pushed for the stateless HTTP fix.
10. Stateless deployment in progress
This proves the fixed revision was rolled out through ECS.
11. Stateless deployment success
This proves the final ECS deployment reached success with healthy targets.
12. CloudFormation teardown
This proves the test resources were cleaned up after evidence capture.
Teardown
Delete ECS first:
aws cloudformation delete-stack \
--region ap-south-1 \
--stack-name mcp-ecs
Then delete infrastructure:
aws cloudformation delete-stack \
--region ap-south-1 \
--stack-name mcp-infrastructure
If CloudFormation cannot delete ECR because images still exist:
aws ecr list-images \
--region ap-south-1 \
--repository-name mcp-fargate-server
aws ecr batch-delete-image \
--region ap-south-1 \
--repository-name mcp-fargate-server \
--image-ids imageDigest=<digest>
In the captured test run, teardown completed after deleting the remaining ECR images.
Production Notes
- Use ElastiCache with Multi-AZ or MemoryDB for stronger production durability.
- Enable encryption in transit and Redis authentication for production.
- Put Redis in private subnets.
- Allow Redis inbound traffic only from the ECS task security group.
- Do not rely on ALB sticky sessions as your durability layer.
- Keep MCP tool operations idempotent where possible.
- For server-sent event resumability, consider external event storage as a separate concern.
Main Lesson
For MCP on ECS Fargate, there are two different kinds of state:
- Application/tool state.
- MCP transport/session-manager state.
Moving only application state to Redis can still fail if the transport session manager is stateful and in memory.
This repository demonstrates a practical ECS-safe pattern:
stateless Streamable HTTP + external Redis logical session state
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.











