MCP Servers

Confluence Hybrid RAG MCP Server

Enables hybrid search and agentic retrieval over Confluence pages via MCP tools, allowing Claude Desktop, Cursor, or Claude Code to search and retrieve Confluence content.

README

Confluence Hybrid RAG + Agentic RAG + MCP + PROD Design

This project implements hybrid agentic RAG over Confluence using Python, bm25s, OpenAI embeddings, RRF, Cohere reranker, Pydantic-AI, and FastMCP. An MCP server and Streamlit chatbot are implemented as core functionalities, enabling Confluence search from Claude Desktop, Cursor, or Claude Code.

Quick Reference — Choose Your Retrieval Approach
Why Combine the Two Patterns?
Architecture
Retrieval Pipeline
Agentic Loop
MCP Integration
Setup
Testing Without a Confluence Instance
Files
Keeping the Index Fresh
Extending
Production Architecture
Beyond Vector Databases — Better Retrieval Approaches

Quick reference — choose your retrieval approach

Pick your approach based on corpus size and the type of questions your users ask. Details, cost estimates, and AWS implementation for each option are in the Production architecture section below.

By corpus size

Confluence pages	Recommended approach	Storage needed
< 50 000	Contextual BM25 + reranker	OpenSearch only
50 000 – 300 000	Contextual BM25 + reranker	OpenSearch only
300 000 – 1 000 000	Contextual hybrid (BM25 + dense) + reranker	OpenSearch + vector index
> 1 000 000	Contextual hybrid + reranker + GraphRAG	OpenSearch + vector index + Neptune

By question type

Questions your users ask	Best approach
"How do I do X?" — specific how-to	Contextual BM25 + reranker
"What is X?" — definitions, policies	Contextual BM25 + reranker
"Find anything about X" — paraphrased, exploratory	Contextual hybrid (BM25 + dense)
"What changed in X and why?" — history, causality	GraphRAG
"Who owns X and what is its current status?" — ownership chains	GraphRAG
All of the above	Contextual hybrid + GraphRAG

All options at a glance

Approach	Quality	Ops complexity	AWS cost / month (medium scale)	Removes vector DB?
Prototype — BM25 + dense + RRF + rerank	Good	Low	~$1 335	No
Vector DB upgrade — Qdrant / OpenSearch kNN	Good	Medium	~$1 400	No
Contextual BM25 — Claude context prefix + BM25 + rerank	Very good	Low	~$1 000	Yes
Contextual hybrid — context prefix + BM25 + dense + rerank	Excellent	Medium	~$1 200	No
GraphRAG — knowledge graph traversal	Excellent on multi-hop	High	~$1 720	Optional
Contextual hybrid + GraphRAG	Best across all query types	Very high	~$1 900	No

Recommended implementation order

Start with Contextual BM25 — add a Claude context-generation step before indexing, remove dense embeddings. Better quality, lower cost, less infrastructure. No changes to the MCP server, agent, or chatbot.
Add dense embeddings back when you observe users rephrasing the same question multiple ways and getting inconsistent results — the signal that BM25 recall is the bottleneck.
Add GraphRAG only after analysing real user queries for 4–6 weeks and confirming that multi-hop questions (history of a decision, ownership chains, impact of an incident) represent a significant share of traffic.

Key principle: the vector database is a scale concern, not a quality concern. Contextual Retrieval and GraphRAG address the actual quality bottlenecks. Fix quality first, then scale storage to match corpus size.

Why combine the two patterns?

What hybrid retrieval solves

Your company wiki uses different words than your users do. A user asks: "How do we handle auth failures?" — the Confluence page is titled "Authentication error handling". BM25 alone misses it (no word overlap). Dense search alone misses it (rare product names get diluted). The BM25 + dense + RRF + rerank pipeline from knowledge/hybrid-retrieval catches both.

What agentic retrieval solves

Some questions need more than one search: "What changed in the deployment process last quarter and why?" requires reading the current process page, finding the ADR that changed it, and cross-referencing an incident report. A single-shot retrieval can't do that. The agentic loop from knowledge/agentic-rag lets the model search multiple times, follow leads, and synthesise across pages.

What MCP adds

MCP makes the whole thing a first-class tool inside any MCP client. Claude in your IDE or chat interface can search Confluence on its own when it needs internal context — without you having to copy-paste docs into the prompt manually.

Architecture

Confluence REST API
       │
       ▼
1-fetch-confluence.py   pull pages → strip HTML → chunk → chunks/*.json
       │
       ▼
2-build-index.py        BM25 index (bm25s)  +  dense embeddings (OpenAI)
                                │                        │
                                └────────────┬───────────┘
                                             ▼
                               indexes/bm25/  indexes/embeddings.npy
                               indexes/meta.json
                                             │
                        ┌────────────────────┼───────────────────────┐
                        ▼                    ▼                       ▼
               3-hybrid-search.py      4-agent.py           5-mcp-server.py
               (interactive CLI)    (pydantic-ai agent)   (FastMCP → Claude)

Retrieval pipeline (inside every search call)

Query
  │
  ├─► BM25           catches exact terms, product names, ticket IDs
  │                  e.g. "JIRA-4521", "prod-db-01", "rerank-v4.0-fast"
  │
  ├─► Dense          catches paraphrase and synonyms
  │   (cosine)       e.g. "auth failure" ↔ "authentication error"
  │
  ├─► RRF            fuses the two ranked lists without score normalisation
  │                  (RRF sidesteps the problem that BM25 and cosine scores
  │                   live on completely different scales)
  │
  └─► Cohere rerank  cross-encoder that sees query + document jointly
      top-50 in      much higher precision than bi-encoder dense alone
      → top-10 out

Agentic loop (inside 4-agent.py and every MCP session)

User question
      │
      ▼
 list_spaces          (which Confluence domains are indexed?)
      │
      ▼
 hybrid_search ──────► BM25 + dense + RRF + rerank
      │
      ▼
 snippets relevant?
      ├─ Yes ──► synthesise answer
      └─ No  ──► get_page_full (read a full page)
                 or hybrid_search again with a refined query
                      │
                      ▼
              synthesise answer + citations

MCP integration

The MCP server in 5-mcp-server.py exposes three tools:

Tool	What it does
`list_confluence_spaces`	Returns indexed spaces (key + name)
`search_confluence`	Four-stage hybrid search; returns snippets
`get_confluence_page`	Returns full text of one page by page_id

Claude acts as the agent — it calls these tools in a loop the same way 4-agent.py does internally. No duplicate agent layer on the server.

Connect Claude Desktop

{
  "mcpServers": {
    "confluence": {
      "url": "http://localhost:8051/sse"
    }
  }
}

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

Connect Claude Code

Add .mcp.json in your project root:

{
  "mcpServers": {
    "confluence": {
      "type": "sse",
      "url": "http://localhost:8051/sse"
    }
  }
}

Connect Cursor

Settings → MCP → Add server → URL: http://localhost:8051/sse

Setup

# 1. Install dependencies (from the repo root)
uv sync

# 2. Copy and fill in credentials
cp .env.example .env

# 3. Fetch Confluence pages (run with empty CONFLUENCE_SPACE_KEYS first
#    to see available spaces, then set the ones you want and re-run)
uv run 1-fetch-confluence.py

# 4. Build BM25 + dense indexes (~$0.002 per 1 000 chunks)
uv run 2-build-index.py

# 5. Test retrieval interactively
uv run 3-hybrid-search.py

# 6. Try the full agent (optional)
uv run 4-agent.py "What is our deployment process?"

# 7. Start the MCP server for Claude Desktop / Cursor / Claude Code
uv run 5-mcp-server.py

You need four API keys in .env:

CONFLUENCE_BASE_URL + CONFLUENCE_EMAIL + CONFLUENCE_API_TOKEN
OPENAI_API_KEY — embeddings (text-embedding-3-small)
COHERE_API_KEY — reranker (rerank-v4.0-fast, free tier)
ANTHROPIC_API_KEY — only needed for 4-agent.py

Testing without a Confluence instance

1-fetch-k8s.py is a drop-in replacement for 1-fetch-confluence.py that scrapes the public Kubernetes documentation (kubernetes.io/docs) instead of your Confluence instance. It produces chunks in the exact same JSON format, so every subsequent step — 2-build-index.py, 3-hybrid-search.py, 5-mcp-server.py, 6-chatbot.py — runs unchanged. No API keys are needed.

Prerequisites

# beautifulsoup4 is the only extra dependency
uv pip install beautifulsoup4

Running the scraper

# Scrape ~190 Kubernetes docs pages (≈ 2 min at 0.5 s/request)
uv run 1-fetch-k8s.py

# Then continue with the normal pipeline
uv run 2-build-index.py
uv run 5-mcp-server.py              # terminal 1
uv run streamlit run 6-chatbot.py   # terminal 2

Expected output:

Fetching sitemap: https://kubernetes.io/en/sitemap.xml
Found 192 pages to index

[  1/192] Concepts                                           3 chunk(s)
[  2/192] Kubernetes Components                              4 chunk(s)
...
Done: 189 pages → 847 chunks  (2 empty, 1 errors)
Chunks saved to: .../chunks/

Configuration

All tunable constants are at the top of 1-fetch-k8s.py:

Constant	Default	What it controls
`INCLUDE_SECTIONS`	`concepts/`, `tasks/`, `tutorials/`, `setup/`, `reference/glossary/`, `reference/kubectl/`	Sections of kubernetes.io/docs to crawl
`SKIP_PATTERNS`	`reference/kubernetes-api/`, `contribute/`	Sub-paths excluded even if they fall under an included section
`REQUEST_DELAY`	`0.5` s	Pause between HTTP requests (respectful crawl rate)
`MAX_CHUNK_CHARS`	`1500`	Maximum characters per chunk
`OVERLAP_CHARS`	`150`	Overlap between consecutive chunks

The large API reference (reference/kubernetes-api/, ~600 pages of spec tables) is excluded by default to keep index size manageable. Add it to INCLUDE_SECTIONS if you need API field lookups.

Output format

Each chunk is saved as a JSON file in chunks/ and uses space_key: "K8S". The schema is identical to Confluence chunks, so 3-hybrid-search.py and the MCP server treat them the same way:

{
  "chunk_id": "docs-concepts-workloads-pods_c0",
  "page_id": "docs-concepts-workloads-pods",
  "title": "Pods",
  "space_key": "K8S",
  "space_name": "Kubernetes Docs",
  "url": "https://kubernetes.io/docs/concepts/workloads/pods/",
  "text": "...",
  "chunk_idx": 0,
  "last_modified": "Mon, 10 Jun 2024 12:00:00 GMT"
}

Example queries once running

"What is the difference between a Deployment and a StatefulSet?"
"How do I configure resource limits for a Pod?"
"What happens when a node fails?"
"How does the Kubernetes scheduler decide where to place a Pod?"

Files

./
├── 1-fetch-confluence.py   Production: fetch & chunk Confluence pages
├── 1-fetch-k8s.py          Test/demo: scrape public Kubernetes docs
├── 2-build-index.py        Build BM25 + dense indexes
├── 3-hybrid-search.py      Interactive search CLI
├── 4-agent.py              Pydantic-AI agent (standalone)
├── 5-mcp-server.py         FastMCP server for Claude/Cursor/Claude Code
├── 6-chatbot.py            Streamlit chatbot (MCP client + chat UI)
├── pyproject.toml          All dependencies
├── .env.example
└── utils/
    ├── confluence.py        ConfluenceClient + html_to_text + chunk()
    ├── retrieval.py         HybridRetriever (BM25+dense+RRF+rerank)
    └── agent_tools.py       list_spaces / hybrid_search / get_page_full
                             shared by 4-agent.py and 5-mcp-server.py

Keeping the index fresh

Confluence changes. A simple refresh loop:

# Re-fetch changed pages and rebuild indexes (run nightly via cron)
uv run 1-fetch-confluence.py && uv run 2-build-index.py

For incremental updates, add last_modified filtering in iter_pages() to skip pages not changed since the last run (compare against the timestamp stored in indexes/meta.json).

Extending

Want to add	Where to change
Filter by Confluence label/ancestor	`ConfluenceClient.iter_pages()` — add `expand=ancestors,metadata.labels`
Parent/child page navigation tool	Add `get_child_pages(page_id)` to `utils/agent_tools.py`
Local reranker (offline)	Swap Cohere in `HybridRetriever` for `BAAI/bge-reranker-v2-m3`
Vector DB instead of numpy	Replace `np.load` in `HybridRetriever` with Qdrant/LanceDB client
Incremental index updates	Track `last_modified` in `meta.json`, skip unchanged pages in step 1
Evaluate retrieval quality	Use `knowledge/hybrid-retrieval/docs/build-your-own-eval.md` as recipe

Production architecture (many GB of Confluence data)

The prototype works well for hundreds of pages but hits hard limits at scale. This section explains what breaks and how to redesign each layer for production.

Why the prototype does not scale

Component	Prototype	Breaks when…
Dense index	`embeddings.npy` loaded fully into RAM	>500 k chunks (~3 GB at 1536 dims, fp32)
BM25 index	`bm25s` rebuilt from scratch every sync	Corpus grows beyond a few hundred MB of text
Sync	Full re-fetch + full re-index	Pages number in the tens of thousands
MCP server	Single-process, no auth	Multiple concurrent users, internal tool exposure
Access control	None — every user sees every page	Any team with Confluence page restrictions
Chunking	Character-based with fixed overlap	Long structured pages (tables, code blocks split badly)

Recommended production stack

Vector database

The right choice depends on where you run infrastructure. The short version: Qdrant if you want the best hybrid search engine; Amazon OpenSearch Service if you are already on AWS and want to stay AWS-native; Bedrock Knowledge Bases if you want zero ingestion pipeline work.

Option	Cloud	Choose if…	Trade-off
Qdrant Cloud	Any (hosted on AWS/GCP/Azure infra)	Starting fresh, want managed ops	Easiest setup; data in Qdrant's account
Qdrant on EKS/ECS	AWS	Data must stay in your AWS account; already run containers	You manage upgrades and backups
Qdrant self-hosted	On-prem / any VM	Full control, air-gapped environments	Full operational burden
Amazon OpenSearch Service	AWS	Already on AWS, want IAM + CloudWatch native integration	Slightly slower ANN than Qdrant at extreme scale
Amazon Bedrock Knowledge Bases	AWS	Want zero ingestion pipeline; Confluence connector built-in	Less control over chunking, reranking, and MCP integration
Azure AI Search	Azure	Already on Azure	Native Confluence connector; higher cost per query
pgvector (RDS/Aurora)	Any	Small corpus (<1 M chunks), already run PostgreSQL	Simplest ops; ANN slows above ~5 M vectors
Pinecone	Any	—	No native sparse support in standard tier; vendor lock-in

Qdrant is the strongest general-purpose choice because it is the only option in this list with native hybrid search — dense + sparse vectors in a single query with built-in RRF — without requiring application-level result merging. For AWS-specific guidance see the Running on AWS section below.

Sparse vectors — BM42 instead of BM25

In production, replace the bm25s BM25 index with BM42 sparse vectors stored inside Qdrant. BM42 is a neural sparse model (built into Qdrant's FastEmbed library) that produces sparse vectors compatible with Qdrant's sparse index. It significantly outperforms classical BM25 on paraphrased queries while preserving the exact-term matching that dense embeddings miss. The result is a single Qdrant collection with both a dense vector field and a sparse vector field, queried together in one round-trip.

Embedding model

Keep text-embedding-3-small as the default (best cost/quality ratio for this use case). Switch to text-embedding-3-large only if your Confluence content is heavily multilingual or filled with specialised technical jargon where the extra embedding capacity measurably improves NDCG on your eval set.

Reranker

Keep Cohere rerank-v4.0-fast for managed convenience. Switch to BAAI/bge-reranker-v2-m3 (self-hosted, ~568 MB) if your data governance policy prohibits sending document snippets to an external API.

Chunking

Replace the character-based chunking in utils/confluence.py with semantic chunking using Docling. Docling understands Confluence's HTML structure — it splits at heading boundaries, keeps table rows together, and does not cut mid-sentence. Pair this with a parent-child chunking strategy: embed small chunks (~256 tokens) for high-precision retrieval, but return the parent section (~1 024 tokens) as context to the agent. This gives precise matching without truncating the evidence the model needs to answer well.

Running on AWS

If your company operates on AWS, each layer of the stack maps to a managed AWS service. You have three realistic paths depending on how much control you want over the retrieval pipeline.

Path 1 — Qdrant on AWS (best retrieval quality, your infra)

Run Qdrant inside your own AWS account so data never leaves your perimeter, while keeping Qdrant's native hybrid search.

Qdrant deployment	When to choose
Qdrant Cloud on AWS	Fastest start; Qdrant manages ops; pick the AWS region closest to your app. Data lives in Qdrant's AWS account — check your data residency policy first.
Qdrant on EKS	Already run Kubernetes; use the official Qdrant Helm chart; EBS volumes for persistent storage; IAM for pod-level auth.
Qdrant on ECS Fargate	No Kubernetes; run the official Docker image as a Fargate service; EFS mount for persistence; simpler ops than EKS.

Path 2 — Amazon OpenSearch Service (AWS-native, recommended for most AWS teams)

This is the pragmatic default for AWS. OpenSearch Service is fully managed, stays inside your AWS account, and has supported hybrid search (BM25 + k-NN vector in a single query) since version 2.10. It replaces Qdrant without any change to the MCP server or agent — only utils/retrieval.py changes.

Why to choose OpenSearch Service over Qdrant on AWS:

IAM authentication — no separate credentials; attach an IAM role to your workers and MCP server, and OpenSearch accepts them natively.
VPC isolation — deploy into a private VPC subnet; no public endpoint needed.
CloudWatch integration — cluster metrics, slow query logs, and index statistics flow to CloudWatch out of the box.
One fewer vendor — no Qdrant Cloud account, no Qdrant billing, no separate support contract. Everything is under your existing AWS bill.
Familiar to AWS ops teams — most AWS platform teams already know how to run OpenSearch.

The trade-off: OpenSearch is slower per node than Qdrant at very high query volume (Qdrant is Rust, OpenSearch is JVM). For a company-internal Confluence search workload you will not reach that ceiling.

Path 3 — Amazon Bedrock Knowledge Bases (fully managed, zero pipeline)

If engineering time is the bottleneck, Bedrock Knowledge Bases can eliminate most of the ingestion pipeline. It has a native Confluence data source connector — you supply OAuth credentials and a space list, and Bedrock handles fetching, chunking, embedding (Amazon Titan or third-party models), and indexing into either OpenSearch Serverless or Aurora pgvector.

What you keep: the MCP server and the Streamlit chatbot. What you replace: 1-fetch-confluence.py, 2-build-index.py, utils/retrieval.py, and utils/confluence.py. The search_confluence tool in the MCP server calls the Bedrock Retrieve API instead of Qdrant directly.

Trade-offs:

Less control over chunk size, overlap, and chunking strategy.
Reranking must go through Bedrock's own reranker; Cohere is not directly pluggable.
Bedrock Agents (not your MCP server) handles the agentic loop if you use RetrieveAndGenerate. If you want to keep the MCP pattern, call only the Retrieve API and drive the loop from your FastMCP server as today.
Cold-start latency on OpenSearch Serverless can be high for infrequent queries.

AWS services mapping

Role	AWS service
Confluence change events	Confluence webhooks → API Gateway → SQS
Ingestion workers	ECS Fargate (auto-scaling) or EKS
Vector store (AWS-native)	Amazon OpenSearch Service (hybrid BM25 + k-NN)
Vector store (Qdrant path)	Qdrant on EKS or ECS Fargate
Fully managed RAG	Amazon Bedrock Knowledge Bases + Confluence connector
Query result cache	ElastiCache for Redis (TTL 15 min)
MCP server hosting	ECS Fargate behind an ALB
TLS termination	ALB (ACM certificate, no Nginx needed)
DDoS + WAF	AWS WAF on the ALB
API key / secret storage	AWS Secrets Manager (rotate without redeploy)
Container images	ECR (Elastic Container Registry)
Monitoring + alerts	CloudWatch metrics, alarms, and dashboards
Embedding API	OpenAI (via internet) or Amazon Bedrock hosted models
Reranker API	Cohere (via internet) or Bedrock reranker

AWS architecture diagram

Confluence Cloud
      │
      │  Webhooks (page_created / updated / deleted)
      ▼
┌──────────────┐    ┌─────────────────────────────────────────┐
│ API Gateway  │───►│  Amazon SQS                              │
│ (webhook     │    │  — buffers change events, decouples      │
│  endpoint)   │    │    Confluence rate limits from workers   │
└──────────────┘    └──────────────────┬──────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────┐
                        │  ECS Fargate — Ingestion     │
                        │  Workers (auto-scaling)      │
                        │                              │
                        │  1. Fetch page (Confluence   │
                        │     REST API)                │
                        │  2. Chunk (Docling)          │
                        │  3. Embed (OpenAI / Bedrock) │
                        │  4. Upsert to vector store   │
                        └──────────────┬──────────────┘
                                       │
                    ┌──────────────────┼────────────────────┐
                    │                  │                     │
                    ▼                  ▼                     ▼
         ┌──────────────────┐  ┌─────────────┐  ┌────────────────────┐
         │ Amazon OpenSearch │  │  Qdrant on  │  │ Bedrock Knowledge  │
         │ Service          │  │  EKS / ECS  │  │ Bases              │
         │ (BM25 + k-NN     │  │  (dense +   │  │ (managed, native   │
         │  hybrid, IAM)    │  │   sparse)   │  │  Confluence sync)  │
         └────────┬─────────┘  └──────┬──────┘  └─────────┬──────────┘
                  └──────────────────┬┘                    │
                                     │                     │
                                     ▼                     │
                        ┌────────────────────┐             │
                        │  ECS Fargate        │             │
                        │  MCP Server         │◄────────────┘
                        │  (FastMCP)          │
                        │  + Secrets Manager  │  ┌──────────────────┐
                        │    for API keys     │◄►│ ElastiCache Redis │
                        └────────┬────────────┘  │ query cache       │
                                 │               │ TTL 15 min        │
                                 │               └──────────────────┘
                                 ▼
                    ┌────────────────────────┐
                    │  ALB (HTTPS, ACM cert) │
                    │  + AWS WAF             │
                    └────────────┬───────────┘
                                 │  HTTPS MCP (SSE)
                                 ▼
              Claude Desktop / Cursor / Claude Code
              Streamlit chatbot (ECS Fargate or EC2)

AWS decision guide

Your situation	Recommended path
Starting from scratch on AWS, want best retrieval quality	Qdrant Cloud on AWS (fastest) → migrate to EKS when you need data residency
Data must stay in your AWS account, run Kubernetes	Qdrant on EKS
Data must stay in your AWS account, no Kubernetes	Amazon OpenSearch Service
AWS platform team, want IAM + CloudWatch native	Amazon OpenSearch Service
Engineering time is the bottleneck, want it working this week	Amazon Bedrock Knowledge Bases
Already run pgvector / RDS Aurora, corpus < 1 M chunks	pgvector — no new service needed

Production architecture diagram (generic)

Confluence Cloud
      │
      │  Webhooks (page_created / page_updated / page_deleted)
      │  or scheduled polling every 15 min via REST API (lastModified filter)
      ▼
┌─────────────────────┐
│   Message Queue      │   Redis Streams / AWS SQS / RabbitMQ
│   change events      │   — decouples ingestion speed from Confluence rate limits
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Ingestion Worker   │   One or more worker processes (Celery / ARQ / plain threads)
│                      │   1. Fetch changed page from Confluence REST API
│   fetch → chunk      │   2. Semantic chunk with Docling
│   embed → upsert     │   3. Embed with text-embedding-3-small (batch)
│                      │   4. Upsert dense + sparse vectors into Qdrant
└──────────┬──────────┘   5. Delete Qdrant points for removed pages
           │
           ▼
┌─────────────────────────────────────────────────────┐
│                     Qdrant Collection                │
│                                                      │
│  point_id: chunk_id                                  │
│  dense vector:  text-embedding-3-small (1 536 dims)  │
│  sparse vector: BM42                                 │
│  payload:  { page_id, title, space_key, url,         │
│              last_modified, labels[], ancestor_ids[] }│
│                                                      │
│  Payload indexes on: space_key, labels, last_modified│
└──────────┬──────────────────────────────────────────┘
           │
           │  Hybrid query (dense + sparse + RRF) with payload filter
           │  → top-50 candidates
           │  → Cohere rerank → top-10
           ▼
┌─────────────────────┐        ┌───────────────────────┐
│   MCP Server         │        │   Redis Cache          │
│   (FastMCP)          │◄──────►│   query → results      │
│                      │        │   TTL: 15 min          │
│   + API key auth     │        └───────────────────────┘
│   + rate limiting    │
│   + request logging  │
└──────────┬──────────┘
           │  MCP protocol (SSE or Streamable HTTP)
           ▼
  Claude Desktop / Cursor / Claude Code / Streamlit chatbot

Access control

This is the most important production concern that the prototype ignores entirely. Confluence has page-level and space-level permissions. Without access control, any user of the chatbot can retrieve content from restricted pages they would not normally be allowed to read.

There are three approaches, ordered by accuracy vs. implementation cost:

Option A — Space-level filtering (simplest, coarse-grained) Index only pages from spaces the chatbot is authorised to see. At query time, filter Qdrant results to the spaces the requesting user's Confluence group can access. Works well when spaces map cleanly to team boundaries. Leaks nothing as long as restricted content is in its own space.

Option B — Permission index (recommended for most companies) When ingesting each page, call the Confluence REST API (GET /wiki/rest/api/content/{id}/restriction) to fetch which users and groups can read it. Store those groups in the Qdrant payload alongside each chunk. At query time, resolve the requesting user's group membership (via Confluence or your IdP) and add a Qdrant payload filter groups IN [user_groups]. The filter runs inside Qdrant before scoring — no restricted results are ever returned. Rebuild this permission payload whenever Confluence restriction changes arrive via webhook.

Option C — Per-request Confluence API check (most accurate, slowest) After Qdrant returns candidates, call the Confluence REST API to verify the requesting user can read each candidate page, and filter out the ones they cannot. Accurate because it uses Confluence's own permission system as the source of truth, but adds a round-trip per result set and becomes a bottleneck at high query volume.

For most enterprise deployments, Option B is the right balance.

Sync strategy

Scenario	Approach
Initial load	Bulk-fetch all spaces in parallel workers; embed in batches of 100; upsert to Qdrant
Ongoing updates	Confluence webhooks → queue → worker upserts only changed chunks
No webhook access	Scheduled polling every 15 min using `lastModified` query parameter; compare against stored `last_modified` in Qdrant payload; skip unchanged pages
Page deleted	Webhook `page_deleted` event → delete all Qdrant points where `page_id == deleted_id`
Space deleted	Delete all Qdrant points where `space_key == deleted_key`

AWS cost estimates

Prices below are approximate, based on us-east-1 on-demand rates as of mid-2026. Verify current prices with the AWS Pricing Calculator before budgeting. All figures are per month.

Tier assumptions

	Small	Medium	Large
Confluence pages indexed	10 000	100 000	500 000
Chunks in index	~50 000	~500 000	~2 500 000
Active users	50	500	2 000
Queries per day	100	1 000	5 000
Queries per month	3 000	30 000	150 000

Monthly cost breakdown — OpenSearch Service path

Component	Small	Medium	Large
Amazon OpenSearch Service
Instance (t3.small × 2 nodes HA)	$52	—	—
Instance (m6g.large × 3 nodes HA)	—	$320	—
Instance (m6g.2xlarge × 3 nodes HA)	—	—	$1 280
EBS storage (gp3)	$3	$27	$135
ECS Fargate — MCP server
0.5 vCPU / 1 GB (1 instance)	$18	—	—
0.5 vCPU / 1 GB (2 instances)	—	$36	—
0.5 vCPU / 1 GB (4 instances)	—	—	$72
ECS Fargate — ingestion workers	$5	$15	$40
ElastiCache Redis
cache.t3.micro	$12	—	—
cache.t3.small × 2 (multi-AZ)	—	$49	—
cache.r6g.large × 2 (multi-AZ)	—	—	$240
ALB + ACM certificate	$20	$25	$35
AWS WAF	—	$10	$20
SQS + Secrets Manager + CloudWatch	$8	$12	$20
OpenAI text-embedding-3-small	<$1	$1	$5
Cohere rerank-v4.0-fast	$6	$60	$300
Anthropic Claude claude-sonnet-4-6	$78	$780	$3 900
Estimated total	~$203	~$1 335	~$6 047

Anthropic cost breakdown per query

Each user question triggers an agentic loop with 2–3 Claude API calls (list_spaces → search → optionally get_page). A realistic average:

Token type	Tokens per query	Cost (Sonnet 4.6)
Input (system prompt + tool results)	~5 500	$0.0165
Output (tool selections + final answer)	~600	$0.0090
Total per query		~$0.026

At 1 000 queries/day × 30 days = ~$780/month in Claude API costs alone. This is the dominant cost at every scale — infrastructure is secondary.

Key observations

The LLM API bill dominates. At medium scale Claude accounts for ~59% of total spend; at large scale ~64%. Optimise here first before touching infrastructure.
Infrastructure costs are reasonable. Even at large scale (2 500 OpenSearch nodes, 4 ECS services, Redis cluster) the AWS bill excluding API costs is ~$1 800/month — roughly one mid-level engineer's monthly salary. The ROI calculation is almost always positive.
OpenSearch storage is cheap. 2.5 M chunks × 300 tokens × ~4 bytes ≈ 3 GB of raw text. With OpenSearch overhead (inverted index + k-NN graph) plan for ~10× = ~30 GB = ~$4/month. Storage is never the problem.

Cost optimisation levers

These are ordered by impact:

1. Anthropic prompt caching — saves 20–30% on Claude costs Enable cache_control on the system prompt and the tool definitions block. Cached input tokens are billed at $0.30/MTok (90% discount vs $3/MTok). The system prompt (~500 tokens) and tool list (~300 tokens) are identical on every call and qualify for caching immediately.

2. Route simple queries to Claude Haiku — saves up to 60% on Claude costs Haiku 4.5 costs $0.80/MTok in and $4/MTok out — roughly 4× cheaper than Sonnet. Add a classifier that sends single-fact lookups ("What is a ConfigMap?") to Haiku and only escalates multi-hop questions ("What changed in our deployment process and why?") to Sonnet. If 60% of queries qualify, medium-scale Claude spend drops from ~$780 to ~$390/month.

3. Redis query caching — saves 20–40% on Claude costs for repeated questions Teams ask the same questions. "How do I request VPN access?" is asked by every new joiner. A Redis cache keyed on a normalised query hash with a 15-minute TTL (already in the architecture diagram) eliminates the Claude round-trip entirely for cache hits. Common internal Q&A workloads see 25–35% cache hit rates.

4. Reserved instances for OpenSearch — saves 30–40% on compute A 1-year Reserved Instance for m6g.large.search drops from $0.148/hr to ~$0.088/hr. On three nodes that saves ~$215/month (from $320 to $190 per 3-node cluster). Commit only after validating instance size in production.

5. Fargate Spot for ingestion workers — saves ~70% on worker compute Ingestion workers are interruptible — if a Spot interruption occurs, SQS re-delivers the message and the worker retries. Switch the ECS task definition to use FARGATE_SPOT capacity provider. At medium scale this saves ~$10/month; at large scale ~$28/month.

6. Cohere free tier for small teams The Cohere trial tier gives 1 000 free rerank calls/month. A team of 50 users making 3 searches/day averages ~4 500 calls/month — just over the free limit. Reduce top_k from 50 to 25 candidates sent to the reranker to halve call volume at a small quality cost.

Realistic optimised costs

Applying caching, Haiku routing (60% of queries), and reserved instances:

	Small	Medium	Large
Baseline estimate	~$203	~$1 335	~$6 047
After optimisations	~$110	~$620	~$2 900
Saving	~46%	~54%	~52%

Amazon Bedrock Knowledge Bases — cost comparison

The fully managed path trades control for simplicity but is not always cheaper:

Component	Monthly cost
Bedrock Titan Embeddings ($0.0001/1K tokens)	~$0.15 (initial), <$1 ongoing
OpenSearch Serverless (minimum 4 OCUs)	~$700
Bedrock `Retrieve` API calls	~$0.10 per 1 000 calls
Anthropic Claude (same as above)	same

The OpenSearch Serverless minimum of 4 OCUs (~$700/month) makes Bedrock Knowledge Bases more expensive than self-managed OpenSearch at small and medium scale. It becomes cost-competitive only above ~2 M chunks where you need multiple OpenSearch data nodes anyway. The main argument for Bedrock Knowledge Bases is not cost — it is engineering time saved on the ingestion pipeline.

Disclaimer: All figures are estimates based on public AWS and API pricing as of mid-2026. Actual costs depend on your specific usage patterns, AWS region, negotiated enterprise pricing, and data transfer costs. Use the AWS Pricing Calculator for precise projections before committing to an architecture.

Beyond vector databases — better retrieval approaches

Swapping one vector database for another improves scale and operational robustness but does almost nothing for retrieval quality. The two approaches below address the actual quality bottlenecks for a Confluence-sized corpus.

Approach 1 — Contextual Retrieval

What it is: Before indexing each chunk, ask Claude to prepend a short context paragraph describing where the chunk sits in the document, what the page is about, and why this section matters. Then index with BM25 only — no dense embeddings at all — and apply the existing Cohere reranker on top.

Anthropic published benchmarks in late 2024 showing this approach achieves 49% fewer retrieval failures compared to naive BM25 + dense hybrid. BM25 with prepended context outperforms BM25 + dense without it.

Why Confluence chunks need this: When you strip HTML from a Confluence page and split it into 1 500-character chunks, the chunks lose their surrounding context. A chunk that says "set the flag to true to enable this feature" scores well for the query "how do I enable features" but is useless without knowing which page it came from and which feature it refers to. The prepended context sentence — "This chunk is from the Engineering Handbook, section on Feature Flags, describing how to enable a new flag in the production config service" — makes the chunk self-contained and dramatically more retrievable.

What changes in the pipeline:

Step	Without contextual retrieval	With contextual retrieval
Indexing	chunk → BM25 + embed → store	chunk → Claude context → contextual chunk → BM25 → store
Dense embeddings	Required	Removed entirely
Vector database	Required	Not needed
BM25 index	Required	Required (same)
Reranker	Required	Required (same)
MCP server	Unchanged	Unchanged
Agent	Unchanged	Unchanged

One-time indexing cost (Claude Haiku at $0.80/MTok):

Corpus size	Chunks	Context tokens	Indexing cost
10 000 pages	50 000	~10 M tokens	~$8
100 000 pages	500 000	~100 M tokens	~$80
500 000 pages	2 500 000	~500 M tokens	~$400

This is a one-time cost per full re-index, not a recurring monthly expense. Delta syncs (only changed pages) are proportionally cheaper.

Running cost impact: Removing dense embeddings eliminates the OpenAI embedding API call on every query (~$0.02/1M tokens, small but real), removes the vector index from OpenSearch or Qdrant (reducing storage by 30–50%), and simplifies the retrieval code to a single BM25 query path.

AWS implementation: Keep OpenSearch Service for BM25. Remove the k-NN plugin configuration and dense vector field entirely. Add a pre-indexing ECS task that calls the Anthropic API to generate context for each chunk before the ingestion worker writes to OpenSearch.

Approach 2 — GraphRAG

What it is: Instead of treating Confluence as a flat collection of text chunks, build a knowledge graph from it — extracting entities, relationships, and summaries — and answer questions by traversing the graph rather than scoring chunks by similarity.

Microsoft Research published GraphRAG in 2024 and showed 20–70% improvement over naive RAG on complex multi-hop questions depending on question type. The gains are largest on exactly the investigative questions that Confluence is used for.

Why Confluence is a natural graph: Confluence already has rich structure that flat retrieval throws away:

Space
  └── Section page
        ├── Child page A  ──links to──► ADR-042
        │     └── Child page A1         │
        └── Child page B                └──triggered by──► Incident-2024-Q3
              (author: team-platform)                        (owner: team-sre)

When a user asks "What changed in our deployment process last quarter and why?", the answer requires:

Find the current deployment process page
Follow links to the ADR that modified it
Find the incident report that triggered the ADR
Synthesise the chain of causality across three pages

A flat vector search returns the chunks with the highest similarity score. A graph traversal follows the actual structure of the knowledge.

How GraphRAG works for Confluence:

Ingestion
  │
  ├─► Extract entities from each page
  │     (system names, team names, process names, ticket IDs, dates)
  │
  ├─► Extract relationships between entities
  │     (page A links to page B, process X was changed by ADR Y,
  │      incident Z triggered decision W)
  │
  ├─► Build community summaries
  │     (cluster related pages into topics, summarise each cluster)
  │
  └─► Store in graph database (Neptune) + keep BM25 for keyword search

Query
  │
  ├─► Global questions ("what are our main deployment processes?")
  │     → community summary traversal, no chunk retrieval needed
  │
  └─► Local questions ("how do I deploy service X?")
        → entity lookup → graph hop → retrieve relevant pages → answer

Two query modes:

Mode	Best for	How it works
Local search	Specific factual questions	Find entity in graph → traverse 1–2 hops → retrieve source pages → answer
Global search	Broad thematic questions	Query community summaries → synthesise across the whole corpus

AWS implementation: Amazon Neptune (fully managed graph database) for the knowledge graph. Neptune Analytics (in-memory graph engine, announced 2023) for fast traversal queries. The ingestion worker gains a graph extraction step that calls Claude to identify entities and relationships per page, then writes edges to Neptune alongside the existing OpenSearch BM25 upsert.

AWS architecture addition for GraphRAG:

Ingestion worker
  │
  ├─► (existing) chunk → BM25 upsert → OpenSearch
  │
  └─► (new) page full text → Claude entity extraction
                              → Neptune upsert (nodes + edges)
                              → Neptune Analytics community clustering (nightly)

Query (MCP server)
  │
  ├─► hybrid_search (existing BM25 + rerank path)
  │
  └─► graph_search (new tool)
        → Neptune Analytics traversal
        → fetch source pages
        → synthesise answer

The existing hybrid_search, get_page_full, and list_spaces MCP tools remain unchanged. graph_search is an additional fourth tool the agent can call for questions that require following relationships across pages.

Cost addition (medium scale, 100k pages):

Component	Monthly cost
Amazon Neptune (db.r6g.large)	~$200
Neptune Analytics (2 NCUs)	~$180
Claude Haiku entity extraction (initial, one-time)	~$50
Claude Haiku entity extraction (monthly delta, 5% change)	~$3
Total addition	~$383/month

Comparison across all approaches

Approach	Retrieval quality	Ops complexity	Monthly cost delta	Best for
Prototype (BM25+dense+RRF+rerank)	Good	Low	baseline	Getting started
Vector DB upgrade (Qdrant/OpenSearch kNN)	Good	Medium	+$0–200	Scale, not quality
Contextual Retrieval + BM25	Very good	Low	−$50–200 (saves embedding cost)	Best quality-to-effort ratio
GraphRAG	Excellent on multi-hop	High	+$350–600	Complex investigative questions
Contextual + GraphRAG	Excellent across all types	Very high	+$250–400 net	Full enterprise production

Recommended implementation order

Phase 1 — Contextual Retrieval (week 1–2) Add a context-generation step before BM25 indexing. Remove dense embeddings and the vector index. This is the highest-ROI change: better retrieval quality, lower running cost, less infrastructure. The MCP server, agent, and chatbot are untouched.

Phase 2 — Vector DB for scale (month 2–3, if corpus > 500k chunks) If the corpus grows large enough that OpenSearch BM25 performance degrades, add a vector index (OpenSearch kNN or Qdrant) back in alongside the contextual BM25. At this scale the quality gain from contextual retrieval still applies on top.

Phase 3 — GraphRAG (month 4–6, if multi-hop questions dominate) Instrument user queries for 4–6 weeks after Phase 1. If a significant share of questions require tracing relationships across pages (history of a decision, ownership chains, impact of an incident), add the Neptune graph layer and the graph_search MCP tool. This is a non-trivial engineering investment — only do it when the query analysis confirms it is the right bottleneck to fix.

Key principle: The vector database is a storage and scale concern. Contextual Retrieval and GraphRAG are quality concerns. Fix quality first, then scale the storage layer to match the corpus size.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

Confluence Hybrid RAG MCP Server

README

Confluence Hybrid RAG + Agentic RAG + MCP + PROD Design

Table of Contents

Quick reference — choose your retrieval approach

By corpus size

By question type

All options at a glance

Recommended implementation order

Why combine the two patterns?

What hybrid retrieval solves

What agentic retrieval solves

What MCP adds

Architecture

Retrieval pipeline (inside every search call)

Agentic loop (inside 4-agent.py and every MCP session)

MCP integration

Connect Claude Desktop

Connect Claude Code

Connect Cursor

Setup

Testing without a Confluence instance

Prerequisites

Running the scraper

Configuration

Output format

Example queries once running

Files

Keeping the index fresh

Extending

Production architecture (many GB of Confluence data)

Why the prototype does not scale

Recommended production stack

Vector database

Sparse vectors — BM42 instead of BM25

Embedding model

Reranker

Chunking

Running on AWS

Path 1 — Qdrant on AWS (best retrieval quality, your infra)

Path 2 — Amazon OpenSearch Service (AWS-native, recommended for most AWS teams)

Path 3 — Amazon Bedrock Knowledge Bases (fully managed, zero pipeline)

AWS services mapping

AWS architecture diagram

AWS decision guide

Production architecture diagram (generic)

Access control

Sync strategy

AWS cost estimates

Tier assumptions

Monthly cost breakdown — OpenSearch Service path

Anthropic cost breakdown per query

Key observations

Cost optimisation levers

Realistic optimised costs

Amazon Bedrock Knowledge Bases — cost comparison

Beyond vector databases — better retrieval approaches

Approach 1 — Contextual Retrieval

Approach 2 — GraphRAG

Comparison across all approaches

Recommended implementation order

Recommended Servers