Confluence Hybrid RAG MCP Server

Confluence Hybrid RAG MCP Server

Enables hybrid search and agentic retrieval over Confluence pages via MCP tools, allowing Claude Desktop, Cursor, or Claude Code to search and retrieve Confluence content.

Category
Visit Server

README

Confluence Hybrid RAG + Agentic RAG + MCP + PROD Design

This project implements hybrid agentic RAG over Confluence using Python, bm25s, OpenAI embeddings, RRF, Cohere reranker, Pydantic-AI, and FastMCP. An MCP server and Streamlit chatbot are implemented as core functionalities, enabling Confluence search from Claude Desktop, Cursor, or Claude Code.


Table of Contents


Quick reference — choose your retrieval approach

Pick your approach based on corpus size and the type of questions your users ask. Details, cost estimates, and AWS implementation for each option are in the Production architecture section below.

By corpus size

Confluence pages Recommended approach Storage needed
< 50 000 Contextual BM25 + reranker OpenSearch only
50 000 – 300 000 Contextual BM25 + reranker OpenSearch only
300 000 – 1 000 000 Contextual hybrid (BM25 + dense) + reranker OpenSearch + vector index
> 1 000 000 Contextual hybrid + reranker + GraphRAG OpenSearch + vector index + Neptune

By question type

Questions your users ask Best approach
"How do I do X?" — specific how-to Contextual BM25 + reranker
"What is X?" — definitions, policies Contextual BM25 + reranker
"Find anything about X" — paraphrased, exploratory Contextual hybrid (BM25 + dense)
"What changed in X and why?" — history, causality GraphRAG
"Who owns X and what is its current status?" — ownership chains GraphRAG
All of the above Contextual hybrid + GraphRAG

All options at a glance

Approach Quality Ops complexity AWS cost / month (medium scale) Removes vector DB?
Prototype — BM25 + dense + RRF + rerank Good Low ~$1 335 No
Vector DB upgrade — Qdrant / OpenSearch kNN Good Medium ~$1 400 No
Contextual BM25 — Claude context prefix + BM25 + rerank Very good Low ~$1 000 Yes
Contextual hybrid — context prefix + BM25 + dense + rerank Excellent Medium ~$1 200 No
GraphRAG — knowledge graph traversal Excellent on multi-hop High ~$1 720 Optional
Contextual hybrid + GraphRAG Best across all query types Very high ~$1 900 No

Recommended implementation order

  1. Start with Contextual BM25 — add a Claude context-generation step before indexing, remove dense embeddings. Better quality, lower cost, less infrastructure. No changes to the MCP server, agent, or chatbot.

  2. Add dense embeddings back when you observe users rephrasing the same question multiple ways and getting inconsistent results — the signal that BM25 recall is the bottleneck.

  3. Add GraphRAG only after analysing real user queries for 4–6 weeks and confirming that multi-hop questions (history of a decision, ownership chains, impact of an incident) represent a significant share of traffic.

Key principle: the vector database is a scale concern, not a quality concern. Contextual Retrieval and GraphRAG address the actual quality bottlenecks. Fix quality first, then scale storage to match corpus size.


Why combine the two patterns?

What hybrid retrieval solves

Your company wiki uses different words than your users do. A user asks: "How do we handle auth failures?" — the Confluence page is titled "Authentication error handling". BM25 alone misses it (no word overlap). Dense search alone misses it (rare product names get diluted). The BM25 + dense + RRF + rerank pipeline from knowledge/hybrid-retrieval catches both.

What agentic retrieval solves

Some questions need more than one search: "What changed in the deployment process last quarter and why?" requires reading the current process page, finding the ADR that changed it, and cross-referencing an incident report. A single-shot retrieval can't do that. The agentic loop from knowledge/agentic-rag lets the model search multiple times, follow leads, and synthesise across pages.

What MCP adds

MCP makes the whole thing a first-class tool inside any MCP client. Claude in your IDE or chat interface can search Confluence on its own when it needs internal context — without you having to copy-paste docs into the prompt manually.

Architecture

Confluence REST API
       │
       ▼
1-fetch-confluence.py   pull pages → strip HTML → chunk → chunks/*.json
       │
       ▼
2-build-index.py        BM25 index (bm25s)  +  dense embeddings (OpenAI)
                                │                        │
                                └────────────┬───────────┘
                                             ▼
                               indexes/bm25/  indexes/embeddings.npy
                               indexes/meta.json
                                             │
                        ┌────────────────────┼───────────────────────┐
                        ▼                    ▼                       ▼
               3-hybrid-search.py      4-agent.py           5-mcp-server.py
               (interactive CLI)    (pydantic-ai agent)   (FastMCP → Claude)

Retrieval pipeline (inside every search call)

Query
  │
  ├─► BM25           catches exact terms, product names, ticket IDs
  │                  e.g. "JIRA-4521", "prod-db-01", "rerank-v4.0-fast"
  │
  ├─► Dense          catches paraphrase and synonyms
  │   (cosine)       e.g. "auth failure" ↔ "authentication error"
  │
  ├─► RRF            fuses the two ranked lists without score normalisation
  │                  (RRF sidesteps the problem that BM25 and cosine scores
  │                   live on completely different scales)
  │
  └─► Cohere rerank  cross-encoder that sees query + document jointly
      top-50 in      much higher precision than bi-encoder dense alone
      → top-10 out

Agentic loop (inside 4-agent.py and every MCP session)

User question
      │
      ▼
 list_spaces          (which Confluence domains are indexed?)
      │
      ▼
 hybrid_search ──────► BM25 + dense + RRF + rerank
      │
      ▼
 snippets relevant?
      ├─ Yes ──► synthesise answer
      └─ No  ──► get_page_full (read a full page)
                 or hybrid_search again with a refined query
                      │
                      ▼
              synthesise answer + citations

MCP integration

The MCP server in 5-mcp-server.py exposes three tools:

Tool What it does
list_confluence_spaces Returns indexed spaces (key + name)
search_confluence Four-stage hybrid search; returns snippets
get_confluence_page Returns full text of one page by page_id

Claude acts as the agent — it calls these tools in a loop the same way 4-agent.py does internally. No duplicate agent layer on the server.

Connect Claude Desktop

{
  "mcpServers": {
    "confluence": {
      "url": "http://localhost:8051/sse"
    }
  }
}

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

Connect Claude Code

Add .mcp.json in your project root:

{
  "mcpServers": {
    "confluence": {
      "type": "sse",
      "url": "http://localhost:8051/sse"
    }
  }
}

Connect Cursor

Settings → MCP → Add server → URL: http://localhost:8051/sse

Setup

# 1. Install dependencies (from the repo root)
uv sync

# 2. Copy and fill in credentials
cp .env.example .env

# 3. Fetch Confluence pages (run with empty CONFLUENCE_SPACE_KEYS first
#    to see available spaces, then set the ones you want and re-run)
uv run 1-fetch-confluence.py

# 4. Build BM25 + dense indexes (~$0.002 per 1 000 chunks)
uv run 2-build-index.py

# 5. Test retrieval interactively
uv run 3-hybrid-search.py

# 6. Try the full agent (optional)
uv run 4-agent.py "What is our deployment process?"

# 7. Start the MCP server for Claude Desktop / Cursor / Claude Code
uv run 5-mcp-server.py

You need four API keys in .env:

  • CONFLUENCE_BASE_URL + CONFLUENCE_EMAIL + CONFLUENCE_API_TOKEN
  • OPENAI_API_KEY — embeddings (text-embedding-3-small)
  • COHERE_API_KEY — reranker (rerank-v4.0-fast, free tier)
  • ANTHROPIC_API_KEY — only needed for 4-agent.py

Testing without a Confluence instance

1-fetch-k8s.py is a drop-in replacement for 1-fetch-confluence.py that scrapes the public Kubernetes documentation (kubernetes.io/docs) instead of your Confluence instance. It produces chunks in the exact same JSON format, so every subsequent step — 2-build-index.py, 3-hybrid-search.py, 5-mcp-server.py, 6-chatbot.py — runs unchanged. No API keys are needed.

Prerequisites

# beautifulsoup4 is the only extra dependency
uv pip install beautifulsoup4

Running the scraper

# Scrape ~190 Kubernetes docs pages (≈ 2 min at 0.5 s/request)
uv run 1-fetch-k8s.py

# Then continue with the normal pipeline
uv run 2-build-index.py
uv run 5-mcp-server.py              # terminal 1
uv run streamlit run 6-chatbot.py   # terminal 2

Expected output:

Fetching sitemap: https://kubernetes.io/en/sitemap.xml
Found 192 pages to index

[  1/192] Concepts                                           3 chunk(s)
[  2/192] Kubernetes Components                              4 chunk(s)
...
Done: 189 pages → 847 chunks  (2 empty, 1 errors)
Chunks saved to: .../chunks/

Configuration

All tunable constants are at the top of 1-fetch-k8s.py:

Constant Default What it controls
INCLUDE_SECTIONS concepts/, tasks/, tutorials/, setup/, reference/glossary/, reference/kubectl/ Sections of kubernetes.io/docs to crawl
SKIP_PATTERNS reference/kubernetes-api/, contribute/ Sub-paths excluded even if they fall under an included section
REQUEST_DELAY 0.5 s Pause between HTTP requests (respectful crawl rate)
MAX_CHUNK_CHARS 1500 Maximum characters per chunk
OVERLAP_CHARS 150 Overlap between consecutive chunks

The large API reference (reference/kubernetes-api/, ~600 pages of spec tables) is excluded by default to keep index size manageable. Add it to INCLUDE_SECTIONS if you need API field lookups.

Output format

Each chunk is saved as a JSON file in chunks/ and uses space_key: "K8S". The schema is identical to Confluence chunks, so 3-hybrid-search.py and the MCP server treat them the same way:

{
  "chunk_id": "docs-concepts-workloads-pods_c0",
  "page_id": "docs-concepts-workloads-pods",
  "title": "Pods",
  "space_key": "K8S",
  "space_name": "Kubernetes Docs",
  "url": "https://kubernetes.io/docs/concepts/workloads/pods/",
  "text": "...",
  "chunk_idx": 0,
  "last_modified": "Mon, 10 Jun 2024 12:00:00 GMT"
}

Example queries once running

  • "What is the difference between a Deployment and a StatefulSet?"
  • "How do I configure resource limits for a Pod?"
  • "What happens when a node fails?"
  • "How does the Kubernetes scheduler decide where to place a Pod?"

Files

./
├── 1-fetch-confluence.py   Production: fetch & chunk Confluence pages
├── 1-fetch-k8s.py          Test/demo: scrape public Kubernetes docs
├── 2-build-index.py        Build BM25 + dense indexes
├── 3-hybrid-search.py      Interactive search CLI
├── 4-agent.py              Pydantic-AI agent (standalone)
├── 5-mcp-server.py         FastMCP server for Claude/Cursor/Claude Code
├── 6-chatbot.py            Streamlit chatbot (MCP client + chat UI)
├── pyproject.toml          All dependencies
├── .env.example
└── utils/
    ├── confluence.py        ConfluenceClient + html_to_text + chunk()
    ├── retrieval.py         HybridRetriever (BM25+dense+RRF+rerank)
    └── agent_tools.py       list_spaces / hybrid_search / get_page_full
                             shared by 4-agent.py and 5-mcp-server.py

Keeping the index fresh

Confluence changes. A simple refresh loop:

# Re-fetch changed pages and rebuild indexes (run nightly via cron)
uv run 1-fetch-confluence.py && uv run 2-build-index.py

For incremental updates, add last_modified filtering in iter_pages() to skip pages not changed since the last run (compare against the timestamp stored in indexes/meta.json).

Extending

Want to add Where to change
Filter by Confluence label/ancestor ConfluenceClient.iter_pages() — add expand=ancestors,metadata.labels
Parent/child page navigation tool Add get_child_pages(page_id) to utils/agent_tools.py
Local reranker (offline) Swap Cohere in HybridRetriever for BAAI/bge-reranker-v2-m3
Vector DB instead of numpy Replace np.load in HybridRetriever with Qdrant/LanceDB client
Incremental index updates Track last_modified in meta.json, skip unchanged pages in step 1
Evaluate retrieval quality Use knowledge/hybrid-retrieval/docs/build-your-own-eval.md as recipe

Production architecture (many GB of Confluence data)

The prototype works well for hundreds of pages but hits hard limits at scale. This section explains what breaks and how to redesign each layer for production.

Why the prototype does not scale

Component Prototype Breaks when…
Dense index embeddings.npy loaded fully into RAM >500 k chunks (~3 GB at 1536 dims, fp32)
BM25 index bm25s rebuilt from scratch every sync Corpus grows beyond a few hundred MB of text
Sync Full re-fetch + full re-index Pages number in the tens of thousands
MCP server Single-process, no auth Multiple concurrent users, internal tool exposure
Access control None — every user sees every page Any team with Confluence page restrictions
Chunking Character-based with fixed overlap Long structured pages (tables, code blocks split badly)

Recommended production stack

Vector database

The right choice depends on where you run infrastructure. The short version: Qdrant if you want the best hybrid search engine; Amazon OpenSearch Service if you are already on AWS and want to stay AWS-native; Bedrock Knowledge Bases if you want zero ingestion pipeline work.

Option Cloud Choose if… Trade-off
Qdrant Cloud Any (hosted on AWS/GCP/Azure infra) Starting fresh, want managed ops Easiest setup; data in Qdrant's account
Qdrant on EKS/ECS AWS Data must stay in your AWS account; already run containers You manage upgrades and backups
Qdrant self-hosted On-prem / any VM Full control, air-gapped environments Full operational burden
Amazon OpenSearch Service AWS Already on AWS, want IAM + CloudWatch native integration Slightly slower ANN than Qdrant at extreme scale
Amazon Bedrock Knowledge Bases AWS Want zero ingestion pipeline; Confluence connector built-in Less control over chunking, reranking, and MCP integration
Azure AI Search Azure Already on Azure Native Confluence connector; higher cost per query
pgvector (RDS/Aurora) Any Small corpus (<1 M chunks), already run PostgreSQL Simplest ops; ANN slows above ~5 M vectors
Pinecone Any No native sparse support in standard tier; vendor lock-in

Qdrant is the strongest general-purpose choice because it is the only option in this list with native hybrid search — dense + sparse vectors in a single query with built-in RRF — without requiring application-level result merging. For AWS-specific guidance see the Running on AWS section below.

Sparse vectors — BM42 instead of BM25

In production, replace the bm25s BM25 index with BM42 sparse vectors stored inside Qdrant. BM42 is a neural sparse model (built into Qdrant's FastEmbed library) that produces sparse vectors compatible with Qdrant's sparse index. It significantly outperforms classical BM25 on paraphrased queries while preserving the exact-term matching that dense embeddings miss. The result is a single Qdrant collection with both a dense vector field and a sparse vector field, queried together in one round-trip.

Embedding model

Keep text-embedding-3-small as the default (best cost/quality ratio for this use case). Switch to text-embedding-3-large only if your Confluence content is heavily multilingual or filled with specialised technical jargon where the extra embedding capacity measurably improves NDCG on your eval set.

Reranker

Keep Cohere rerank-v4.0-fast for managed convenience. Switch to BAAI/bge-reranker-v2-m3 (self-hosted, ~568 MB) if your data governance policy prohibits sending document snippets to an external API.

Chunking

Replace the character-based chunking in utils/confluence.py with semantic chunking using Docling. Docling understands Confluence's HTML structure — it splits at heading boundaries, keeps table rows together, and does not cut mid-sentence. Pair this with a parent-child chunking strategy: embed small chunks (~256 tokens) for high-precision retrieval, but return the parent section (~1 024 tokens) as context to the agent. This gives precise matching without truncating the evidence the model needs to answer well.


Running on AWS

If your company operates on AWS, each layer of the stack maps to a managed AWS service. You have three realistic paths depending on how much control you want over the retrieval pipeline.

Path 1 — Qdrant on AWS (best retrieval quality, your infra)

Run Qdrant inside your own AWS account so data never leaves your perimeter, while keeping Qdrant's native hybrid search.

Qdrant deployment When to choose
Qdrant Cloud on AWS Fastest start; Qdrant manages ops; pick the AWS region closest to your app. Data lives in Qdrant's AWS account — check your data residency policy first.
Qdrant on EKS Already run Kubernetes; use the official Qdrant Helm chart; EBS volumes for persistent storage; IAM for pod-level auth.
Qdrant on ECS Fargate No Kubernetes; run the official Docker image as a Fargate service; EFS mount for persistence; simpler ops than EKS.

Path 2 — Amazon OpenSearch Service (AWS-native, recommended for most AWS teams)

This is the pragmatic default for AWS. OpenSearch Service is fully managed, stays inside your AWS account, and has supported hybrid search (BM25 + k-NN vector in a single query) since version 2.10. It replaces Qdrant without any change to the MCP server or agent — only utils/retrieval.py changes.

Why to choose OpenSearch Service over Qdrant on AWS:

  • IAM authentication — no separate credentials; attach an IAM role to your workers and MCP server, and OpenSearch accepts them natively.
  • VPC isolation — deploy into a private VPC subnet; no public endpoint needed.
  • CloudWatch integration — cluster metrics, slow query logs, and index statistics flow to CloudWatch out of the box.
  • One fewer vendor — no Qdrant Cloud account, no Qdrant billing, no separate support contract. Everything is under your existing AWS bill.
  • Familiar to AWS ops teams — most AWS platform teams already know how to run OpenSearch.

The trade-off: OpenSearch is slower per node than Qdrant at very high query volume (Qdrant is Rust, OpenSearch is JVM). For a company-internal Confluence search workload you will not reach that ceiling.

Path 3 — Amazon Bedrock Knowledge Bases (fully managed, zero pipeline)

If engineering time is the bottleneck, Bedrock Knowledge Bases can eliminate most of the ingestion pipeline. It has a native Confluence data source connector — you supply OAuth credentials and a space list, and Bedrock handles fetching, chunking, embedding (Amazon Titan or third-party models), and indexing into either OpenSearch Serverless or Aurora pgvector.

What you keep: the MCP server and the Streamlit chatbot. What you replace: 1-fetch-confluence.py, 2-build-index.py, utils/retrieval.py, and utils/confluence.py. The search_confluence tool in the MCP server calls the Bedrock Retrieve API instead of Qdrant directly.

Trade-offs:

  • Less control over chunk size, overlap, and chunking strategy.
  • Reranking must go through Bedrock's own reranker; Cohere is not directly pluggable.
  • Bedrock Agents (not your MCP server) handles the agentic loop if you use RetrieveAndGenerate. If you want to keep the MCP pattern, call only the Retrieve API and drive the loop from your FastMCP server as today.
  • Cold-start latency on OpenSearch Serverless can be high for infrequent queries.

AWS services mapping

Role AWS service
Confluence change events Confluence webhooks → API GatewaySQS
Ingestion workers ECS Fargate (auto-scaling) or EKS
Vector store (AWS-native) Amazon OpenSearch Service (hybrid BM25 + k-NN)
Vector store (Qdrant path) Qdrant on EKS or ECS Fargate
Fully managed RAG Amazon Bedrock Knowledge Bases + Confluence connector
Query result cache ElastiCache for Redis (TTL 15 min)
MCP server hosting ECS Fargate behind an ALB
TLS termination ALB (ACM certificate, no Nginx needed)
DDoS + WAF AWS WAF on the ALB
API key / secret storage AWS Secrets Manager (rotate without redeploy)
Container images ECR (Elastic Container Registry)
Monitoring + alerts CloudWatch metrics, alarms, and dashboards
Embedding API OpenAI (via internet) or Amazon Bedrock hosted models
Reranker API Cohere (via internet) or Bedrock reranker

AWS architecture diagram

Confluence Cloud
      │
      │  Webhooks (page_created / updated / deleted)
      ▼
┌──────────────┐    ┌─────────────────────────────────────────┐
│ API Gateway  │───►│  Amazon SQS                              │
│ (webhook     │    │  — buffers change events, decouples      │
│  endpoint)   │    │    Confluence rate limits from workers   │
└──────────────┘    └──────────────────┬──────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────┐
                        │  ECS Fargate — Ingestion     │
                        │  Workers (auto-scaling)      │
                        │                              │
                        │  1. Fetch page (Confluence   │
                        │     REST API)                │
                        │  2. Chunk (Docling)          │
                        │  3. Embed (OpenAI / Bedrock) │
                        │  4. Upsert to vector store   │
                        └──────────────┬──────────────┘
                                       │
                    ┌──────────────────┼────────────────────┐
                    │                  │                     │
                    ▼                  ▼                     ▼
         ┌──────────────────┐  ┌─────────────┐  ┌────────────────────┐
         │ Amazon OpenSearch │  │  Qdrant on  │  │ Bedrock Knowledge  │
         │ Service          │  │  EKS / ECS  │  │ Bases              │
         │ (BM25 + k-NN     │  │  (dense +   │  │ (managed, native   │
         │  hybrid, IAM)    │  │   sparse)   │  │  Confluence sync)  │
         └────────┬─────────┘  └──────┬──────┘  └─────────┬──────────┘
                  └──────────────────┬┘                    │
                                     │                     │
                                     ▼                     │
                        ┌────────────────────┐             │
                        │  ECS Fargate        │             │
                        │  MCP Server         │◄────────────┘
                        │  (FastMCP)          │
                        │  + Secrets Manager  │  ┌──────────────────┐
                        │    for API keys     │◄►│ ElastiCache Redis │
                        └────────┬────────────┘  │ query cache       │
                                 │               │ TTL 15 min        │
                                 │               └──────────────────┘
                                 ▼
                    ┌────────────────────────┐
                    │  ALB (HTTPS, ACM cert) │
                    │  + AWS WAF             │
                    └────────────┬───────────┘
                                 │  HTTPS MCP (SSE)
                                 ▼
              Claude Desktop / Cursor / Claude Code
              Streamlit chatbot (ECS Fargate or EC2)

AWS decision guide

Your situation Recommended path
Starting from scratch on AWS, want best retrieval quality Qdrant Cloud on AWS (fastest) → migrate to EKS when you need data residency
Data must stay in your AWS account, run Kubernetes Qdrant on EKS
Data must stay in your AWS account, no Kubernetes Amazon OpenSearch Service
AWS platform team, want IAM + CloudWatch native Amazon OpenSearch Service
Engineering time is the bottleneck, want it working this week Amazon Bedrock Knowledge Bases
Already run pgvector / RDS Aurora, corpus < 1 M chunks pgvector — no new service needed

Production architecture diagram (generic)

Confluence Cloud
      │
      │  Webhooks (page_created / page_updated / page_deleted)
      │  or scheduled polling every 15 min via REST API (lastModified filter)
      ▼
┌─────────────────────┐
│   Message Queue      │   Redis Streams / AWS SQS / RabbitMQ
│   change events      │   — decouples ingestion speed from Confluence rate limits
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Ingestion Worker   │   One or more worker processes (Celery / ARQ / plain threads)
│                      │   1. Fetch changed page from Confluence REST API
│   fetch → chunk      │   2. Semantic chunk with Docling
│   embed → upsert     │   3. Embed with text-embedding-3-small (batch)
│                      │   4. Upsert dense + sparse vectors into Qdrant
└──────────┬──────────┘   5. Delete Qdrant points for removed pages
           │
           ▼
┌─────────────────────────────────────────────────────┐
│                     Qdrant Collection                │
│                                                      │
│  point_id: chunk_id                                  │
│  dense vector:  text-embedding-3-small (1 536 dims)  │
│  sparse vector: BM42                                 │
│  payload:  { page_id, title, space_key, url,         │
│              last_modified, labels[], ancestor_ids[] }│
│                                                      │
│  Payload indexes on: space_key, labels, last_modified│
└──────────┬──────────────────────────────────────────┘
           │
           │  Hybrid query (dense + sparse + RRF) with payload filter
           │  → top-50 candidates
           │  → Cohere rerank → top-10
           ▼
┌─────────────────────┐        ┌───────────────────────┐
│   MCP Server         │        │   Redis Cache          │
│   (FastMCP)          │◄──────►│   query → results      │
│                      │        │   TTL: 15 min          │
│   + API key auth     │        └───────────────────────┘
│   + rate limiting    │
│   + request logging  │
└──────────┬──────────┘
           │  MCP protocol (SSE or Streamable HTTP)
           ▼
  Claude Desktop / Cursor / Claude Code / Streamlit chatbot

Access control

This is the most important production concern that the prototype ignores entirely. Confluence has page-level and space-level permissions. Without access control, any user of the chatbot can retrieve content from restricted pages they would not normally be allowed to read.

There are three approaches, ordered by accuracy vs. implementation cost:

Option A — Space-level filtering (simplest, coarse-grained) Index only pages from spaces the chatbot is authorised to see. At query time, filter Qdrant results to the spaces the requesting user's Confluence group can access. Works well when spaces map cleanly to team boundaries. Leaks nothing as long as restricted content is in its own space.

Option B — Permission index (recommended for most companies) When ingesting each page, call the Confluence REST API (GET /wiki/rest/api/content/{id}/restriction) to fetch which users and groups can read it. Store those groups in the Qdrant payload alongside each chunk. At query time, resolve the requesting user's group membership (via Confluence or your IdP) and add a Qdrant payload filter groups IN [user_groups]. The filter runs inside Qdrant before scoring — no restricted results are ever returned. Rebuild this permission payload whenever Confluence restriction changes arrive via webhook.

Option C — Per-request Confluence API check (most accurate, slowest) After Qdrant returns candidates, call the Confluence REST API to verify the requesting user can read each candidate page, and filter out the ones they cannot. Accurate because it uses Confluence's own permission system as the source of truth, but adds a round-trip per result set and becomes a bottleneck at high query volume.

For most enterprise deployments, Option B is the right balance.


Sync strategy

Scenario Approach
Initial load Bulk-fetch all spaces in parallel workers; embed in batches of 100; upsert to Qdrant
Ongoing updates Confluence webhooks → queue → worker upserts only changed chunks
No webhook access Scheduled polling every 15 min using lastModified query parameter; compare against stored last_modified in Qdrant payload; skip unchanged pages
Page deleted Webhook page_deleted event → delete all Qdrant points where page_id == deleted_id
Space deleted Delete all Qdrant points where space_key == deleted_key

AWS cost estimates

Prices below are approximate, based on us-east-1 on-demand rates as of mid-2026. Verify current prices with the AWS Pricing Calculator before budgeting. All figures are per month.

Tier assumptions

Small Medium Large
Confluence pages indexed 10 000 100 000 500 000
Chunks in index ~50 000 ~500 000 ~2 500 000
Active users 50 500 2 000
Queries per day 100 1 000 5 000
Queries per month 3 000 30 000 150 000

Monthly cost breakdown — OpenSearch Service path

Component Small Medium Large
Amazon OpenSearch Service
Instance (t3.small × 2 nodes HA) $52
Instance (m6g.large × 3 nodes HA) $320
Instance (m6g.2xlarge × 3 nodes HA) $1 280
EBS storage (gp3) $3 $27 $135
ECS Fargate — MCP server
0.5 vCPU / 1 GB (1 instance) $18
0.5 vCPU / 1 GB (2 instances) $36
0.5 vCPU / 1 GB (4 instances) $72
ECS Fargate — ingestion workers $5 $15 $40
ElastiCache Redis
cache.t3.micro $12
cache.t3.small × 2 (multi-AZ) $49
cache.r6g.large × 2 (multi-AZ) $240
ALB + ACM certificate $20 $25 $35
AWS WAF $10 $20
SQS + Secrets Manager + CloudWatch $8 $12 $20
OpenAI text-embedding-3-small <$1 $1 $5
Cohere rerank-v4.0-fast $6 $60 $300
Anthropic Claude claude-sonnet-4-6 $78 $780 $3 900
Estimated total ~$203 ~$1 335 ~$6 047

Anthropic cost breakdown per query

Each user question triggers an agentic loop with 2–3 Claude API calls (list_spaces → search → optionally get_page). A realistic average:

Token type Tokens per query Cost (Sonnet 4.6)
Input (system prompt + tool results) ~5 500 $0.0165
Output (tool selections + final answer) ~600 $0.0090
Total per query ~$0.026

At 1 000 queries/day × 30 days = ~$780/month in Claude API costs alone. This is the dominant cost at every scale — infrastructure is secondary.

Key observations

  1. The LLM API bill dominates. At medium scale Claude accounts for ~59% of total spend; at large scale ~64%. Optimise here first before touching infrastructure.

  2. Infrastructure costs are reasonable. Even at large scale (2 500 OpenSearch nodes, 4 ECS services, Redis cluster) the AWS bill excluding API costs is ~$1 800/month — roughly one mid-level engineer's monthly salary. The ROI calculation is almost always positive.

  3. OpenSearch storage is cheap. 2.5 M chunks × 300 tokens × ~4 bytes ≈ 3 GB of raw text. With OpenSearch overhead (inverted index + k-NN graph) plan for ~10× = ~30 GB = ~$4/month. Storage is never the problem.

Cost optimisation levers

These are ordered by impact:

1. Anthropic prompt caching — saves 20–30% on Claude costs Enable cache_control on the system prompt and the tool definitions block. Cached input tokens are billed at $0.30/MTok (90% discount vs $3/MTok). The system prompt (~500 tokens) and tool list (~300 tokens) are identical on every call and qualify for caching immediately.

2. Route simple queries to Claude Haiku — saves up to 60% on Claude costs Haiku 4.5 costs $0.80/MTok in and $4/MTok out — roughly 4× cheaper than Sonnet. Add a classifier that sends single-fact lookups ("What is a ConfigMap?") to Haiku and only escalates multi-hop questions ("What changed in our deployment process and why?") to Sonnet. If 60% of queries qualify, medium-scale Claude spend drops from ~$780 to ~$390/month.

3. Redis query caching — saves 20–40% on Claude costs for repeated questions Teams ask the same questions. "How do I request VPN access?" is asked by every new joiner. A Redis cache keyed on a normalised query hash with a 15-minute TTL (already in the architecture diagram) eliminates the Claude round-trip entirely for cache hits. Common internal Q&A workloads see 25–35% cache hit rates.

4. Reserved instances for OpenSearch — saves 30–40% on compute A 1-year Reserved Instance for m6g.large.search drops from $0.148/hr to ~$0.088/hr. On three nodes that saves ~$215/month (from $320 to $190 per 3-node cluster). Commit only after validating instance size in production.

5. Fargate Spot for ingestion workers — saves ~70% on worker compute Ingestion workers are interruptible — if a Spot interruption occurs, SQS re-delivers the message and the worker retries. Switch the ECS task definition to use FARGATE_SPOT capacity provider. At medium scale this saves ~$10/month; at large scale ~$28/month.

6. Cohere free tier for small teams The Cohere trial tier gives 1 000 free rerank calls/month. A team of 50 users making 3 searches/day averages ~4 500 calls/month — just over the free limit. Reduce top_k from 50 to 25 candidates sent to the reranker to halve call volume at a small quality cost.

Realistic optimised costs

Applying caching, Haiku routing (60% of queries), and reserved instances:

Small Medium Large
Baseline estimate ~$203 ~$1 335 ~$6 047
After optimisations ~$110 ~$620 ~$2 900
Saving ~46% ~54% ~52%

Amazon Bedrock Knowledge Bases — cost comparison

The fully managed path trades control for simplicity but is not always cheaper:

Component Monthly cost
Bedrock Titan Embeddings ($0.0001/1K tokens) ~$0.15 (initial), <$1 ongoing
OpenSearch Serverless (minimum 4 OCUs) ~$700
Bedrock Retrieve API calls ~$0.10 per 1 000 calls
Anthropic Claude (same as above) same

The OpenSearch Serverless minimum of 4 OCUs (~$700/month) makes Bedrock Knowledge Bases more expensive than self-managed OpenSearch at small and medium scale. It becomes cost-competitive only above ~2 M chunks where you need multiple OpenSearch data nodes anyway. The main argument for Bedrock Knowledge Bases is not cost — it is engineering time saved on the ingestion pipeline.

Disclaimer: All figures are estimates based on public AWS and API pricing as of mid-2026. Actual costs depend on your specific usage patterns, AWS region, negotiated enterprise pricing, and data transfer costs. Use the AWS Pricing Calculator for precise projections before committing to an architecture.


Beyond vector databases — better retrieval approaches

Swapping one vector database for another improves scale and operational robustness but does almost nothing for retrieval quality. The two approaches below address the actual quality bottlenecks for a Confluence-sized corpus.


Approach 1 — Contextual Retrieval

What it is: Before indexing each chunk, ask Claude to prepend a short context paragraph describing where the chunk sits in the document, what the page is about, and why this section matters. Then index with BM25 only — no dense embeddings at all — and apply the existing Cohere reranker on top.

Anthropic published benchmarks in late 2024 showing this approach achieves 49% fewer retrieval failures compared to naive BM25 + dense hybrid. BM25 with prepended context outperforms BM25 + dense without it.

Why Confluence chunks need this: When you strip HTML from a Confluence page and split it into 1 500-character chunks, the chunks lose their surrounding context. A chunk that says "set the flag to true to enable this feature" scores well for the query "how do I enable features" but is useless without knowing which page it came from and which feature it refers to. The prepended context sentence — "This chunk is from the Engineering Handbook, section on Feature Flags, describing how to enable a new flag in the production config service" — makes the chunk self-contained and dramatically more retrievable.

What changes in the pipeline:

Step Without contextual retrieval With contextual retrieval
Indexing chunk → BM25 + embed → store chunk → Claude context → contextual chunk → BM25 → store
Dense embeddings Required Removed entirely
Vector database Required Not needed
BM25 index Required Required (same)
Reranker Required Required (same)
MCP server Unchanged Unchanged
Agent Unchanged Unchanged

One-time indexing cost (Claude Haiku at $0.80/MTok):

Corpus size Chunks Context tokens Indexing cost
10 000 pages 50 000 ~10 M tokens ~$8
100 000 pages 500 000 ~100 M tokens ~$80
500 000 pages 2 500 000 ~500 M tokens ~$400

This is a one-time cost per full re-index, not a recurring monthly expense. Delta syncs (only changed pages) are proportionally cheaper.

Running cost impact: Removing dense embeddings eliminates the OpenAI embedding API call on every query (~$0.02/1M tokens, small but real), removes the vector index from OpenSearch or Qdrant (reducing storage by 30–50%), and simplifies the retrieval code to a single BM25 query path.

AWS implementation: Keep OpenSearch Service for BM25. Remove the k-NN plugin configuration and dense vector field entirely. Add a pre-indexing ECS task that calls the Anthropic API to generate context for each chunk before the ingestion worker writes to OpenSearch.


Approach 2 — GraphRAG

What it is: Instead of treating Confluence as a flat collection of text chunks, build a knowledge graph from it — extracting entities, relationships, and summaries — and answer questions by traversing the graph rather than scoring chunks by similarity.

Microsoft Research published GraphRAG in 2024 and showed 20–70% improvement over naive RAG on complex multi-hop questions depending on question type. The gains are largest on exactly the investigative questions that Confluence is used for.

Why Confluence is a natural graph: Confluence already has rich structure that flat retrieval throws away:

Space
  └── Section page
        ├── Child page A  ──links to──► ADR-042
        │     └── Child page A1         │
        └── Child page B                └──triggered by──► Incident-2024-Q3
              (author: team-platform)                        (owner: team-sre)

When a user asks "What changed in our deployment process last quarter and why?", the answer requires:

  1. Find the current deployment process page
  2. Follow links to the ADR that modified it
  3. Find the incident report that triggered the ADR
  4. Synthesise the chain of causality across three pages

A flat vector search returns the chunks with the highest similarity score. A graph traversal follows the actual structure of the knowledge.

How GraphRAG works for Confluence:

Ingestion
  │
  ├─► Extract entities from each page
  │     (system names, team names, process names, ticket IDs, dates)
  │
  ├─► Extract relationships between entities
  │     (page A links to page B, process X was changed by ADR Y,
  │      incident Z triggered decision W)
  │
  ├─► Build community summaries
  │     (cluster related pages into topics, summarise each cluster)
  │
  └─► Store in graph database (Neptune) + keep BM25 for keyword search

Query
  │
  ├─► Global questions ("what are our main deployment processes?")
  │     → community summary traversal, no chunk retrieval needed
  │
  └─► Local questions ("how do I deploy service X?")
        → entity lookup → graph hop → retrieve relevant pages → answer

Two query modes:

Mode Best for How it works
Local search Specific factual questions Find entity in graph → traverse 1–2 hops → retrieve source pages → answer
Global search Broad thematic questions Query community summaries → synthesise across the whole corpus

AWS implementation: Amazon Neptune (fully managed graph database) for the knowledge graph. Neptune Analytics (in-memory graph engine, announced 2023) for fast traversal queries. The ingestion worker gains a graph extraction step that calls Claude to identify entities and relationships per page, then writes edges to Neptune alongside the existing OpenSearch BM25 upsert.

AWS architecture addition for GraphRAG:

Ingestion worker
  │
  ├─► (existing) chunk → BM25 upsert → OpenSearch
  │
  └─► (new) page full text → Claude entity extraction
                              → Neptune upsert (nodes + edges)
                              → Neptune Analytics community clustering (nightly)

Query (MCP server)
  │
  ├─► hybrid_search (existing BM25 + rerank path)
  │
  └─► graph_search (new tool)
        → Neptune Analytics traversal
        → fetch source pages
        → synthesise answer

The existing hybrid_search, get_page_full, and list_spaces MCP tools remain unchanged. graph_search is an additional fourth tool the agent can call for questions that require following relationships across pages.

Cost addition (medium scale, 100k pages):

Component Monthly cost
Amazon Neptune (db.r6g.large) ~$200
Neptune Analytics (2 NCUs) ~$180
Claude Haiku entity extraction (initial, one-time) ~$50
Claude Haiku entity extraction (monthly delta, 5% change) ~$3
Total addition ~$383/month

Comparison across all approaches

Approach Retrieval quality Ops complexity Monthly cost delta Best for
Prototype (BM25+dense+RRF+rerank) Good Low baseline Getting started
Vector DB upgrade (Qdrant/OpenSearch kNN) Good Medium +$0–200 Scale, not quality
Contextual Retrieval + BM25 Very good Low −$50–200 (saves embedding cost) Best quality-to-effort ratio
GraphRAG Excellent on multi-hop High +$350–600 Complex investigative questions
Contextual + GraphRAG Excellent across all types Very high +$250–400 net Full enterprise production

Recommended implementation order

Phase 1 — Contextual Retrieval (week 1–2) Add a context-generation step before BM25 indexing. Remove dense embeddings and the vector index. This is the highest-ROI change: better retrieval quality, lower running cost, less infrastructure. The MCP server, agent, and chatbot are untouched.

Phase 2 — Vector DB for scale (month 2–3, if corpus > 500k chunks) If the corpus grows large enough that OpenSearch BM25 performance degrades, add a vector index (OpenSearch kNN or Qdrant) back in alongside the contextual BM25. At this scale the quality gain from contextual retrieval still applies on top.

Phase 3 — GraphRAG (month 4–6, if multi-hop questions dominate) Instrument user queries for 4–6 weeks after Phase 1. If a significant share of questions require tracing relationships across pages (history of a decision, ownership chains, impact of an incident), add the Neptune graph layer and the graph_search MCP tool. This is a non-trivial engineering investment — only do it when the query analysis confirms it is the right bottleneck to fix.

Key principle: The vector database is a storage and scale concern. Contextual Retrieval and GraphRAG are quality concerns. Fix quality first, then scale the storage layer to match the corpus size.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured