Volterra Knowledge Engine
A read-only Model Context Protocol server that exposes a semantic knowledge base to AI agents via 27 tools. It enables querying of documents and data integrated from sources like Notion, SharePoint, HubSpot, and Slack.
README
Volterra Knowledge Engine
Enterprise document ingestion pipeline with AI embeddings, multi-source connectors, and GDPR-compliant semantic search.
Architecture
graph TB
CLI[CLI Commands] -->|Ingest| DP[Document Processor]
DP -->|Parse| Parsers[Format Parsers]
DP -->|Embed| OAI[OpenAI API]
DP -->|Store| DB[(PostgreSQL + pgvector)]
DP -->|Compliance| PII[PII Detector]
subgraph Sources
FS[Local Files]
NO[Notion API]
SP[SharePoint]
HS[HubSpot]
SL[Slack Export]
end
Sources -->|Fetch| DP
subgraph Edge Functions
HTS[HubSpot Ticket Sync]
NPS[Notion Pages Sync]
SCS[Slack Channel Sync]
MCP[MCP Server]
end
Cron[pg_cron] -->|Scheduled| Edge Functions
Edge Functions -->|Read/Write| DB
Key Features
- Multi-source ingestion — Local files, Notion, SharePoint, HubSpot, Slack with unified processing pipeline
- Format support — PDF, DOCX, XLSX, CSV, HTML, email, plain text with extensible parser architecture
- pgvector embeddings — OpenAI text-embedding-3-small (1536d) with HNSW indexes for semantic search
- GDPR compliance — Automatic PII detection, sensitivity classification, and access level enforcement
- Automated sync — pg_cron + Edge Functions for daily data ingestion from Notion, HubSpot, Slack
- MCP server — Read-only Model Context Protocol server exposing 27 tools for AI agent access
- n8n integration — Workflow management CLI for automating ingestion pipelines
Tech Stack
| Layer | Technology |
|---|---|
| Runtime | Node.js 18+ with TypeScript (ESM) |
| Database | PostgreSQL + pgvector (Supabase) |
| Embeddings | OpenAI text-embedding-3-small (1536d) |
| Parsers | pdfjs-dist, mammoth, xlsx, mailparser |
| Sources | Notion API, Microsoft Graph, HubSpot API, Slack API |
| Compliance | Custom PII detector with redact-pii, franc (language) |
| Scheduling | pg_cron + Supabase Edge Functions |
| CLI | Commander.js with structured logging (Winston) |
Project Structure
src/
├── core/
│ ├── document-processor.ts # Main orchestration (451 lines)
│ ├── embedding-service.ts # OpenAI embedding generation
│ └── metadata-inference.ts # Auto-classification
├── parsers/ # Format-specific text extraction
│ ├── pdf-parser.ts
│ ├── docx-parser.ts
│ ├── xlsx-parser.ts
│ ├── wod-parser.ts # Structured deal data extraction
│ └── ...
├── sources/ # Data source connectors
│ ├── notion-source.ts
│ ├── sharepoint-source.ts
│ ├── hubspot-source.ts
│ └── slack-source.ts
├── compliance/
│ ├── pii-detector.ts # PII pattern detection
│ └── gdpr-handler.ts # Sensitivity classification
├── services/
│ ├── n8n-api-client.ts # n8n REST API client
│ └── vision-service.ts # GPT-4o image analysis
└── scripts/ # CLI entry points
supabase/
├── functions/ # Edge Functions (sync, MCP)
└── migrations/ # PostgreSQL schema migrations
Getting Started
-
Install dependencies:
npm install -
Configure environment:
cp .env.example .env -
Set up database (run migrations in Supabase SQL Editor):
CREATE EXTENSION IF NOT EXISTS vector; -- Then apply migration files in chronological order -
Ingest documents:
# Local files npm run ingest:file ./documents/ # From Notion npm run ingest:notion # From HubSpot npm run ingest:hubspot # Slack export npm run ingest:slack -- --export-path /path/to/export
GDPR Compliance
The system automatically detects PII (emails, phone numbers, SSNs, names) and classifies document sensitivity:
| Mode | Behavior |
|---|---|
| Flag | Detects and flags PII, stores original content |
| Redact | Replaces PII with placeholders before storing |
Documents with detected PII are automatically upgraded to restricted or confidential access levels.
Key Design Decisions
- Extensible parser architecture — Base class pattern makes adding new format parsers trivial
- Source-agnostic processing — All sources normalize to the same document interface before embedding
- HNSW over IVFFlat — Better recall accuracy for semantic search at slightly higher index build cost
- pg_cron for sync — Database-native scheduling avoids external cron services
- MCP server — Exposes knowledge base to AI agents via standardized protocol
Built By
Adrian Marten — GitHub
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.