Triage MCP Server
Enables AI agents to autonomously check, diagnose, and recover Dockerized services through safe, tool-based ops without direct host shell access.
README
๐ฉบ Triage โ a self-healing ops MCP for any Dockerized service
Let an AI agent (or a human) check, diagnose, and recover a service โ without a host shell.
Most "give the agent ops powers" setups are bad: you either hand the model a raw shell (now it can roam the whole box and conflate unrelated subsystems), or you wire up dashboards a model can't read. Triage is the third option:
A small MCP server that exposes a handful of health/diagnose/recover tools. Each returns raw evidence AND a plain-English translation, a suggested action, and whether the fix is safe to auto-apply. The agent acts through tools โ it never touches the host directly.
The policy that makes it safe
| Class | Tools | Behaviour |
|---|---|---|
| Auto-fix safe | triage_restart_process, triage_recover |
An agent may run these on its own and report after. Infra only โ no data touched. |
| Ask before risky | triage_apply(confirm=true) |
Anything that could lose data / change external state. Dry-run unless confirm=true. |
| Can't self-fix | (reported) | Diagnosed and handed to the human with exact steps โ never faked. |
The dual raw + layman output is the differentiator: the agent gets structured data to act on, and the
human gets a sentence they can actually understand ("Postiz's API engine isn't running โ the known cold-boot hiccup. I'll restart it.").
Tools
| Tool | Kind | What it does |
|---|---|---|
triage_health() |
read | Containers + configured in-container processes + optional dependency ping. |
triage_diagnose() |
read | Health check matched to a runbook โ issues with raw + plain-English + action + can_auto_fix. |
triage_logs(lines) |
read | Raw service log tail. |
triage_restart_process(name) |
safe | Restart one in-container process (pm2). |
triage_recover() |
safe | Recreate the service container from compose. No volumes/data touched. |
triage_apply(confirm) |
risky | Dry-run by default; runs the configured risky command only on confirm=true. |
Configure (zero code changes)
Everything is env-driven โ point it at any compose-managed service:
TRIAGE_COMPOSE=/path/to/docker-compose.yaml # compose file
TRIAGE_SERVICE=app # the main container/service name
TRIAGE_LABEL="My App" # friendly name used in messages
TRIAGE_PROCS=backend,worker # optional: in-container processes to watch
TRIAGE_PROC_MGR=pm2 # "pm2" | "none"
TRIAGE_DB_PING="docker exec app-db pg_isready" # optional: rc 0 = dependency healthy
TRIAGE_RISKY_CMD="" # optional: a guarded recovery (clear a queue, etc.)
TRIAGE_RISKY_DESC="clear the stuck job queue"
TRIAGE_PORT=9500
See .env.example.
Run
pip install -r requirements.txt
python3 triage.py # serves an MCP over streamable-http on TRIAGE_PORT
Register it with your agent runtime (any MCP client). For an always-on host service, use the included
launchd template com.triage.ops.plist (macOS) โ adapt to systemd on Linux.
Hard-won lessons baked in
- Agents in a container can't see host processes. Give them status tools, not a shell. With shell access a model conflates unrelated subsystems and reports false negatives. Tools keep it honest.
- Two reports, always. Structured
rawfor the agent to branch on; a one-sentencelaymanfor the human. A health check the human can't read is half a tool. - Encode the safe/risky boundary in the tool, not the prompt. "Don't clear the queue without asking" in
a system prompt is a suggestion; a
confirm=true-gated dry-run is a guarantee. docker compose ps --format jsonvaries by version (NDJSON vs single array) โ handle both.- Recover โ restart. A dead process needs a restart; an unhealthy container needs a recreate. Separate tools so the agent escalates correctly.
Built by
Built by KodeKing ยท author Fazal Shah. We build local, private, multi-agent AI systems for teams who can't send their data to the cloud. Issues and PRs welcome.
License
MIT โ see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.