AllBrain MCP
A multi-agent workflow orchestration server with a global SQLite event store, supporting DAG-based task execution, conflict resolution, and intent extraction.
README
AllBrain MCP
One Brain. Multiple Agents.
AllBrain MCP captures raw agent events into a global SQLite-backed brain so a new agent can resume project context later.
Implemented core:
- FastMCP stdio server
- Global SQLite store at
~/.allbrain/allbrain.db - Canonical project identity
- Mandatory session-bound append-only events
- Stable event ordering with UUIDv7 and timestamps
save_event()andlist_events()MCP tools- Event type registry with unknown event rejection
- Git context tools with safe non-repo behavior
resume_project()built from raw events plus optional Git context- Snapshot-backed incremental resume
- Manual
create_snapshot()checkpointing - Weighted auto snapshots
- Snapshot/reducer/compression version checks
- Explicit snapshot delta merge strategy
- Multi-agent event attribution with
agent_id,impact_score,caused_by, andbranch - Conflict detection and resolution tools
- Layered multi-agent resume output
- Rule-based semantic intent extraction
- Intent graph and contradiction detection
- Intent-aware resume output
uv run allbrain start --project . --agent codex
Semantic event types:
goal_settask_startedtask_completedfile_modifiedfailuretask_blocked
Audit events use tool_call. They do not mutate task state, but they are exposed as secondary tool_usage signal in resume output.
Snapshots are derived checkpoints. Raw events remain the only source of truth, and snapshots can be rebuilt from the event stream.
Snapshot metadata stores snapshot_schema_version, reducer_version, and compression_version. Incompatible snapshots are ignored and rebuilt from raw events instead of being trusted silently.
Sprint 4 adds conflict-aware multi-agent context. resume_project() includes global_view, agent_view, conflict_view, decision_view, merged_state, and resolved_conflicts while preserving the legacy top-level fields for compatibility.
Conflict decisions are conservative: low-margin conflicts are marked needs_review, and conflict-aware decision_view.next_step overrides the global resume suggestion.
Sprint 5 adds deterministic semantic intent tooling. extract_intents(), detect_contradictions(), and resume_with_intent() derive intent context from raw events without LLMs, embeddings, or a vector database.
Intent confidence evolves from supporting evidence, intent lifecycle status tracks active/completed/blocked state, graph edges include an edge_type, and contradictions include a numeric severity_score.
Intent extraction collapses file churn inside an active task into supporting evidence for the main intent, avoids supportive refactor/test false positives, and keeps snapshot+delta intent replay equivalent to full replay.
Sprint 9 introduces the Workflow Engine — the Orchestrator core. This is a foundational change: instead of scheduling tasks atomically, the engine now schedules subtasks within a DAG, handles dependency-aware execution, aggregates multi-agent outputs, and recovers from failures at the node level.
Components:
TaskGraphwithTaskNodeandTaskEdgeabstractionsDependencyEngine: DAG validation, cycle detection, topological sort, ready-set calculation, critical path, and blocking reason analysisWorkflowStateMachine: PENDING → READY → RUNNING → COMPLETED / FAILED / BLOCKED with validated transitionsSubtaskScheduler: SchedulerV1 evolution that schedules subtasks, not just tasks, respecting dependency readiness and max-parallel limitsResultAggregator: Combines Architect/Build/Reviewer outputs with CONCAT, MERGE, VOTE, and SUMMARY strategiesRecoveryManager: Node-level retry with exponential backoff, cascading block for exhausted retries, and workflow resume with completed result replayWorkflowEngine: Orchestrates the full lifecycle — create workflow from subtasks, step through the DAG (process completions, failures, scheduling), and run to completion
Example: "Implement OAuth Login" decomposes into a DAG: Design API → Implement Backend → Security Review → Write Tests ───────┘
The engine runs this DAG step by step. If node 3 fails, only node 3 retries — the rest of the workflow does not restart.
Key design decisions:
- Event-sourced: new semantic event types added (
subtask_created,subtask_started,subtask_completed,subtask_failed,workflow_state_changed,retry_scheduled,workflow_created,workflow_started,workflow_completed,workflow_failed,result_aggregated) - Idempotent recovery: completed nodes are replayed into a resumed workflow via
engine.resume() - Isolated module:
allbrain/workflow/does not mutate existing orchestrator code; integration viaorchestrator/workflow_bridge.pyis planned for future sprints - Full test coverage: 30 unit/integration tests covering DAG ops, state machine, scheduling, aggregation, recovery, serialization, and end-to-end workflow execution
The existing task-level orchestrator (allbrain/orchestrator/) remains fully operational. No regressions introduced (111 of 112 existing tests pass; the one failure is pre-existing in test_agent_profile_scheduler.py).
Sprint 10 introduces the Agent Runtime Layer + Async Executor — moving AllBrain from "plans workflows" to "actually runs agents." This is the first sprint where the system can execute real LLM calls (Claude, OpenAI, Gemini, Qwen, OpenCode CLI, Codex CLI) through a unified adapter contract.
Components:
AgentDefinitionschema: id, name, version, provider, capabilities, cost, latency profile, max context, adapter class, config, safety limitsAgentRegistry: central registry with auto-discovery from environment variables (ANTHROPIC_API_KEY,OPENAI_API_KEY,GOOGLE_API_KEY,DASHSCOPE_API_KEY,OPENCODE_AVAILABLE,CODEX_AVAILABLE)AgentAdapterABC: provider-agnostic execution contract withexecute(),health_check(),estimate_cost()SafetyWrapper: input sanitization (prompt injection defense), cost ceiling (per-call + per-workflow), rate limiting, output validationExecutionMetrics: duration, token counts, cost, success/failure, collected per executionCapabilityLearner: EMA-based auto-learning from execution metrics — capability scores evolve from observed success ratesTaskQueueABC +InMemoryTaskQueue: async FIFO queue, Redis/RabbitMQ-swap-readyWorkerPool: N-worker async dispatch with graceful shutdown and in-flight trackingAgentRuntime: bridgesWorkflowEngine→TaskQueue→WorkerPool→AgentAdapter→SafetyWrapper→MetricsCollector→CapabilityLearnerMockAdapter: zero-cost adapter for testing without real LLM calls
Execution model (distributed-first, async event-driven):
WorkflowEngine
|
v
AgentRuntime.execute_subtask(assignment)
|
v
SafetyWrapper (sanitize, cost check, rate limit)
|
v
Adapter.execute(task, context) -- runs in thread executor with timeout
|
v
ExecutionMetrics -- recorded + fed to CapabilityLearner
|
v
SubtaskResult -- back to Workflow Engine
Key design decisions applied from the Sprint 9 review:
- Event-sourced single source of truth: Workflow state remains a derived view; agent execution events are written to the same event store
- Engine/Scheduler/Runtime boundary clarified: Scheduler decides "who", Engine decides "how + when", Runtime executes "actually run"
- Safety first: every adapter call goes through SafetyWrapper with hard cost ceilings
- Capability auto-learning: metrics from real executions feed back into the scheduling layer
- Distributed-ready queue:
TaskQueueABC allows swappingInMemoryTaskQueuefor Redis/RabbitMQ without changing the runtime
Adapter slots for future sprints: Claude, OpenAI, Gemini, Qwen, OpenCode CLI, Codex CLI. All share the same AgentAdapter contract.
Test coverage: 41 new tests covering definition serialization, registry, safety (cost ceiling, rate limit, input sanitization, domain allowlist), metrics collection, capability learning (EMA convergence, cold start, latency tracking), queue operations, worker pool lifecycle, runtime execution (success, failure, timeout, unknown agent, batch), and end-to-end workflow + runtime integration.
Full test suite: 182 tests, 181 passing (one pre-existing failure in test_unhealthy_reviewer_is_skipped unrelated to this sprint).
Sprint 33 introduces the World Model Layer — the cognitive shift from "decide then act" to "predict then decide". The system can now ask "what happens if I do this?" before committing, and feeds the answer into the closed-loop learning engine.
Components:
WorldState,Prediction,SimulationResult: pydantic models withextra="forbid"and bounded numeric fields;Predictionadds aconfidencescore (0-1) for downstream calibrationEnvironmentTracker: deterministicWorldStatecaptureStateTransitionBridge: immutablemodel_copy(update=...)transitions; input never mutatedPredictionBridge: deterministic risk/success/cost/confidence rules (deploywithouttestsis high risk)SimulationBridge: combines transition + prediction, mints auuid7simulation_idWorldModelfacade: pureobserve()andsimulate(action, state); no event writing at this layerWorldStateBuilder: projection from event list to world state (derived view, not in-memory)WorldHistory: event-derived query helper forlatest_state()andlatest_simulation()
Pipeline integration:
SystemDecisionPipeline.run(...)gainssimulate_before_execute: bool = Falseandrisk_threshold: float = 0.7- When enabled, the pipeline emits
world_state_observedandworld_simulation_runbetweenfinal_decision_recordedand the scheduler - If
prediction.risk >= risk_threshold, the runtime state machine transitions toBLOCKEDwith reasonworld_simulation_high_risk - Otherwise the world
success_probabilityoverridesexecution_plan["predicted_success"]so the closed-loop learning engine compares world model output against the actual outcome
New event types:
world_state_observed— emitted on everyobserve()callworld_simulation_run— emitted on everysimulate()call withimpact_score = prediction.risk
New MCP tools:
observe_world(project_path, limit)— captures a freshWorldStateand emits the eventsimulate_action(action, project_path, limit)— captures state, simulates the action, emits both events
Replay equivalence: EventReplayEngine routes world events into a new state["world"] key. WorldStateBuilder is the projection; the world state is fully reconstructable from the event log alone. The replay equivalence test asserts replay(events)["final_state"]["world"] matches WorldStateBuilder().build(events) exactly.
Deferred to future sprints (raised during planning, not in this scope): action.metadata for richer action descriptors, payload_version on world events for migration safety, and a tighter test_replay_simulation_prediction_equivalence beyond the builder-level check.
Test coverage: 11 new tests in tests/test_world.py covering event emission, prediction rules, transition immutability, history round-trip, replay equivalence, MCP impl stability, pipeline simulation gating, and world-to-learning prediction feedback.
Full test suite: 225 tests, 225 passing, no regressions.
The next layer is Sprint 34 — Counterfactual Reasoning: "what if I had not done this?" and "which alternative is best?".
Sprint 34 adds the counterfactual reasoning layer on top of the world model. The system can now ask "what would have happened if I had chosen differently?", compute decision regret, and produce advisory recommendations with severity bands.
Components:
CounterfactualResult,RankedAlternative: pydantic models withextra="forbid";improvement = alt.success − actual.success,regret = max(0, improvement)recommendation_severity(improvement)returnsLiteral["low", "medium", "high"]with bands[0.20, 0.40)/[0.40, 0.70)/>= 0.70AlternativeGenerator: deterministicACTION_MAP(deploy → [run_tests, delay_deploy, rollback],delete → [backup, archive])CounterfactualEvaluator: stateless compare usingSimulationBridgefor both actual and alternativeAlternativeRanker: stateless rank bysuccess_probability − riskCounterfactualEngine: facade,analyze(state, action, limit=N)andrank(state, actions)CounterfactualProjection: replay projection withanalyses,generated,recommendations,unknown_actions,count,unknown_action_count,recommendation_count
Pipeline integration:
SystemDecisionPipeline.run(...)gainsenable_counterfactual: bool = False,counterfactual_limit: int = Field(ge=1, le=100),regret_threshold: float = Field(ge=0.0, le=1.0)- Pipeline raises
ValueErrorwhencounterfactual_limit < 1(defense in depth alongside the schema validation) - Runs after the world simulation step, before EXECUTION; the pipeline observes a fresh
WorldStateon its own (independent ofsimulate_before_execute) - R1 advisory only: never overrides
final_decision. Continues to EXECUTION regardless.counterfactual_recommendationis emitted only whenbest.improvement >= regret_threshold - Learning integration (S1 plain): the prediction dict is enriched with
best_alternativeandregretbeforeClosedLoopLearningEngine.evaluate().error_deltaformula is unchanged
New event types:
counterfactual_generated— at the start of an analysis. If the action is unknown, payload includesreason: "unknown_action"and an emptyalternativeslistcounterfactual_evaluated— once per alternativecounterfactual_recommendation— only when threshold met, withseverityandimpact_score = improvement
New MCP tools:
generate_counterfactual(action, project_path, limit, counterfactual_limit)— runsengine.analyze()and writes eventsrank_alternatives(actions, project_path, limit)— runsAlternativeRanker.rank()(read-only, no events)
Replay equivalence: EventReplayEngine routes counterfactual_* events into a new state["counterfactual"] key. CounterfactualProjection is the projection. The replay equivalence test asserts replay(events)["final_state"]["counterfactual"] == CounterfactualProjection().build(events) exactly.
Future metrics (Sprint 35+, not implemented in this sprint): average_regret, rolling_regret, high_regret_count (with severity breakdown), unknown_action_rate, regret_by_objective_kind. The data is already in the event log; the evolution/organizational learning layer would consume the projection.
Test coverage: 13 new tests in tests/test_counterfactual.py covering alternative generation, improvement/regret math, ranking, event emission, the unknown_action metric, projection build, replay equivalence, severity bands, pipeline integration (gating, learning integration, validation), and MCP tools.
Full test suite: 238 tests, 238 passing, no regressions.
The next layer is decision quality analytics: aggregating regret history into dashboards and tying the unknown-action metric to action knowledge base expansion.
Sprint 35 adds the scenario planning layer on top of counterfactual reasoning. The system can now ask "what are all the futures that could unfold from this action, and how spread out are they?" by running the same action against four different state overlays.
Components:
ScenarioResult,ScenarioAnalysis: pydantic models withextra="forbid";analysis_id: UUID(uuid7) for replay debugging and observability timeline;confidence: float0-1 from templateScenarioTemplate(frozen dataclass): name +environment_state_overlay(additive merge) +environment_state_remove(explicit key removal) +resources_overlay+resources_remove+ confidence + description +template_versionapply_overlay(state, template): immutable state modifier usingmodel_copy(update=...)ScenarioGenerator:defaults()returns 4 named templates;from_specs(specs)builds custom onesScenarioEvaluator: stateless, takes a simulator, returnsScenarioResultScenarioRanker:select(results)picks best/worst/safest/expected;metrics(results)computesprediction_spread,risk_volatility,uncertainty,confidence_totalScenarioEngine: facade,analyze(state, action, limit=N)andevaluate_custom(state, action, scenarios)ScenarioProjection: replay projection that deduplicatesanalysis_idsvia aseen_idsset
Metrics exposed:
prediction_spread = best.success - worst.successrisk_volatility = max(risk) - min(risk)uncertainty = 1 - sum(confidence * prediction.confidence)confidence_total = sum(scenario confidences)(sanity, ~1.0)
Default templates:
best_case(confidence 0.25): environment ={tests: passed, deployment: ready}, all resources trueexpected_case(confidence 0.50): no overlay, baseline trajectoryworst_case(confidence 0.15): environmenttestsremoved, resources ={internet: false, disk: false}safest_case(confidence 0.10): environment ={tests: passed, deployment: verified}, all resources true
State overlay semantics (O2): overlay fields merge additively. Removing keys requires an explicit environment_state_remove / resources_remove list. apply_overlay is immutable and never mutates the input state.
Pipeline integration:
SystemDecisionPipeline.run(...)gainsenable_scenarios: bool = False,scenarios_limit: int = Field(ge=1, le=20),scenario_recommendation_threshold: float = Field(ge=0.0, le=1.0)- Pipeline raises
ValueErrorwhenscenarios_limit < 1(defense in depth alongside the schema validation) - Runs after the counterfactual step, before EXECUTION; the scenario step observes a fresh
WorldStateon its own (D1 independent) - R1 advisory: never overrides
final_decision. Continues to EXECUTION regardless.scenario_recommendedis emitted with rationale every time - Learning integration: the prediction dict is enriched with
prediction_spread,risk_volatility, anduncertaintybeforeClosedLoopLearningEngine.evaluate()
New event types:
scenario_generated— payload includestemplate_version: 1,analysis_id, and the list of actual scenario names evaluatedscenario_evaluated— one per scenario result, withimpact_score = confidencescenario_recommended— always emitted (R1) withbest_case,expected_case,rationale, andtemplate_version
New MCP tools:
generate_scenarios(action, project_path, limit, scenarios_limit)— runsengine.analyze()and writes eventsevaluate_scenarios(action, scenarios, project_path, limit)— runsengine.evaluate_custom()with user-provided scenario dicts; per-scenario events are emitted (not the 4 defaults)
Replay equivalence: EventReplayEngine routes scenario_* events into a new state["scenarios"] key. ScenarioProjection is the projection. The replay equivalence test asserts replay(events)["final_state"]["scenarios"] == ScenarioProjection().build(events) exactly.
Future metrics (Sprint 36+, not implemented in this sprint):
normalized_spread = prediction_spread / expected_case.success_probability— same 0.20 spread at expected=0.80 vs expected=0.30 is not the same forecast disagreementscenario_accuracy— post-hoc comparison of each scenario'ssuccess_probabilityagainst the actualactual_successrecorded inRUNTIME_FEEDBACK_RECORDED; belongs to the evolution layeranalysis_idtimeline — across runs, surface how often the sameanalysis_idcorrelates with downstreamdecision_regretto learn whether scenario spread is a leading indicator of regrettemplate_versionmigration tooling when template semantics change
Test coverage: 13 new tests in tests/test_scenarios.py covering default templates, best/worst/safest selection, metrics, overlay remove semantics, event emission, projection dedup, replay equivalence, pipeline integration (output, learning integration, validation), and custom-scenario MCP tool.
Full test suite: 251 tests, 251 passing, no regressions.
The next step is decision quality analytics: aggregating regret history and tying the unknown-action metric to action knowledge base expansion.
Sprint 36 adds the strategic foresight layer on top of multi-future scenarios. The system now asks "which sequence of actions produces the best long-term outcome?" by simulating plans step by step with state chaining.
Components:
FuturePlan: pydantic model withactions,predicted_success,cumulative_risk,cumulative_cost,horizon,confidence,step_states(debug hook)ForesightAnalysis: pydantic model withanalysis_id: UUID(uuid7),action,best_plan,safest_plan,fastest_plan,expected_plan,plan_spread,strategy_uncertainty,horizon_risk,template_version=1,plansDEPLOY_PLANS: static list of 4 default plans for thedeployaction (P1 single list)ActionPlanner:generate(action)returns plans fordeployor[]otherwiseMultiStepSimulator: chainsSimulationBridgethrough each step, returns(final_state, predictions, step_states)(MS1)PlanEvaluator: enforcesmax_horizon(T1 reject) and computes the plan metricsPlanRanker:select(plans)picks best/safest/fastest/expected by scorepredicted_success - cumulative_risk(S1 plain)ForesightEngine: facade,analyze(state, action, limit)andevaluate_custom(state, actions)ForesightProjection: replay projection withanalyses,generated,recommendations,analysis_ids,count,recommendation_count(deduplicated)
Step states debug hook: MultiStepSimulator.simulate(state, actions) returns step_states (initial + N step states), captured in FuturePlan.step_states and serialized to event payload. Makes "which action broke the state" and "which step created drift" obvious.
Pipeline integration:
SystemDecisionPipeline.run(...)gainsenable_foresight: bool = False,foresight_limit: int = Field(ge=1, le=20),max_horizon: int = Field(ge=1, le=20)- Pipeline raises
ValueErrorwhenforesight_limit < 1ormax_horizon < 1(defense in depth alongside the schema validation) - Runs after the scenarios step, before EXECUTION; the foresight step observes a fresh
WorldStateon its own (D1 independent) - Plans longer than
max_horizonraiseValueError(T1 reject) - R1 advisory: never overrides
final_decision. Continues to EXECUTION regardless.foresight_recommendedis emitted with rationale every time - Learning integration: the prediction dict is enriched with
future_horizon,strategy_uncertainty, andhorizon_riskbeforeClosedLoopLearningEngine.evaluate()
New event types:
foresight_generated— payload includestemplate_version: 1,analysis_id,plans_count,plan_idsforesight_evaluated— one per plan, withimpact_score = predicted_successforesight_recommended— always emitted (R1) withbest_plan,expected_plan,rationale,template_version
New MCP tools:
generate_future_plans(action, project_path, limit, foresight_limit, max_horizon)— runsengine.analyze()and writes eventsevaluate_plan(actions, project_path, limit, max_horizon)— runsengine.evaluate_custom()on a user-provided plan;max_horizonenforces T1 reject
Replay equivalence: EventReplayEngine routes foresight_* events into a new state["foresight"] key. ForesightProjection is the projection. The replay equivalence test asserts replay(events)["final_state"]["foresight"] == ForesightProjection().build(events) exactly.
Boundary clarity (per the user's mental model):
counterfactual(Sprint 34): one-step alternative analysisscenario(Sprint 35): one-state multi-world analysisforesight(Sprint 36): multi-step trajectory analysis
Future metrics (Sprint 37+, not implemented in this sprint):
horizon_cost— distinct fromcumulative_cost; weighted by step position for discounting distant costsworst_step_risk—max(p.risk for p in predictions). The currentcumulative_risk = averageis "soft"; the worst-step view makes catastrophic steps visibleplan_depth— explicit split betweenhorizon(model capacity) andplan_length(actual plan length)plan_regret— best_plan success minus the chosen plan success; belongs to the evolution layer- Extensible planning templates (P2 dict) — currently only
deployis supported payload_versionmigration onworld,counterfactual, andscenarioevents (deferred from Sprint 33 onwards)
Test coverage: 16 new tests in tests/test_foresight.py covering plan generation, best/safest/fastest selection, step states debug hook, horizon metrics, projection build, event emission, replay equivalence, pipeline integration (output, learning integration, validation), max_horizon T1 reject, unknown action sentinel, and custom-plan MCP tool.
Full test suite: 267 tests, 267 passing, no regressions.
The system can now say: "I can deploy now. Running tests first increases success. The best long-term strategy is run_tests → fix_failures → deploy → monitor with predicted success 95%, risk 15%, horizon 4 steps." This is the first time AllBrain thinks in sequences, not just single actions.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.