MCP Servers

caseware-ai-procurement-knowledge-platform

Enables AI assistants to retrieve, search, and compare procurement documents using hybrid retrieval and MCP integration.

README

Caseware AI Procurement Knowledge Platform

AI-ready procurement knowledge platform built with a local data pipeline, hybrid retrieval, and the Model Context Protocol (MCP).

Overview

This project implements an end-to-end AI-ready data platform for procurement and inventory documents.

The solution demonstrates how structured and unstructured business documents can be ingested, transformed into searchable knowledge, and exposed through an MCP (Model Context Protocol) server, allowing AI assistants to retrieve evidence, reason across related documents, and generate grounded responses with source references.

The implementation intentionally remains lightweight and fully local while showcasing modern AI Data Engineering concepts, including:

PDF and image ingestion
OCR fallback for scanned documents
Structured metadata extraction
Semantic embeddings
Hybrid retrieval
Cross-document relationship matching
MCP tool integration
Grounded AI responses

The architecture prioritizes simplicity, explainability, and reproducibility, following the challenge recommendation to avoid over-engineering.

Key Capabilities

PDF document ingestion
OCR using Tesseract
Native PDF parsing with PyMuPDF
Structured metadata extraction
SQLite metadata store
ChromaDB vector database
SentenceTransformers embeddings
Hybrid retrieval (metadata + semantic search)
Cross-document relationship matching
Procurement document comparison
MCP server integration
Claude Desktop integration
Grounded source citations

Design Goals

The solution was intentionally designed to demonstrate the core architectural components of an AI-ready data platform while keeping the implementation easy to understand and reproduce.

Primary goals include:

Reproducible local execution
AI-ready document preparation
Explainable retrieval
Hybrid search combining deterministic metadata and semantic similarity
Modular architecture with clear separation of concerns
Agent integration through MCP

Rather than focusing on production-scale infrastructure, the implementation emphasizes engineering decisions, maintainability, and retrieval quality.

High-Level Architecture

flowchart TD

    A[Raw Procurement Documents]

    A --> B[PDF Parser]
    A --> C[OCR - Tesseract]

    B --> D[Extracted Text]
    C --> D

    D --> E[Chunking]

    E --> F[Metadata Extraction]
    E --> G[SentenceTransformers Embeddings]

    F --> H[(SQLite)]
    G --> I[(ChromaDB)]

    H --> J[Hybrid Retrieval Layer]
    I --> J

    J --> K[FastMCP Server]

    K --> L[Claude Desktop]

Data Flow

The ingestion pipeline performs the following steps:

Load procurement documents from the local filesystem.
Parse native PDFs using PyMuPDF.
Apply OCR to scanned documents using Tesseract.
Normalize extracted text.
Extract structured procurement metadata.
Split documents into retrieval-ready chunks.
Generate semantic embeddings.
Store structured metadata in SQLite.
Store semantic vectors in ChromaDB.
Expose retrieval capabilities through an MCP server.

Technology Stack

Layer	Technology
Language	Python
PDF Parsing	PyMuPDF
OCR	Tesseract
Embeddings	SentenceTransformers
Metadata Store	SQLite
Vector Database	ChromaDB
MCP Framework	FastMCP
AI Client	Claude Desktop

Project Structure

The project is organized into independent modules following a clear separation of concerns. Each component has a single responsibility, making the solution easier to understand, maintain, and extend.

caseware-ai-data-mcp/
│
├── app/
│   ├── pipeline/
│   │   ├── extract.py          # PDF parsing and OCR
│   │   ├── chunk.py            # Document chunking
│   │   ├── model.py            # Metadata extraction
│   │   ├── ingest.py           # End-to-end ingestion pipeline
│   │   └── index.py            # ChromaDB indexing
│   │
│   ├── retrieval/
│   │   ├── search.py           # Semantic retrieval
│   │   ├── matching.py         # Cross-document matching
│   │   ├── hybrid.py           # Hybrid retrieval
│   │   └── citations.py        # Source references
│   │
│   ├── db.py                   # SQLite initialization
│   └── server.py               # FastMCP server
│
├── data/
│   └── raw/                    # Procurement documents
│
├── storage/
│   ├── knowledge.db            # SQLite metadata store
│   └── chroma/                 # ChromaDB vector index
│
├── run_pipeline.py
├── requirements.txt
└── README.md

Module Responsibilities

Pipeline

The pipeline transforms raw procurement documents into AI-ready knowledge.

Responsibilities include:

Reading procurement documents
Parsing PDF files
Running OCR when required
Extracting structured metadata
Chunking document content
Generating semantic embeddings
Populating SQLite
Building the ChromaDB vector index

Retrieval

The retrieval layer is responsible for answering user questions.

It combines two complementary strategies:

Deterministic metadata lookup
Semantic vector search

This hybrid approach improves retrieval precision while maintaining flexibility for natural language queries.

Storage

Structured and semantic information are intentionally stored separately.

# SQLite

Stores:

Document metadata
Extracted procurement fields
Chunk metadata
Document relationships

# ChromaDB

Stores:

Sentence embeddings
Semantic vector index

Separating these responsibilities keeps the architecture simple while allowing each technology to focus on its strengths.

MCP Server

The FastMCP server exposes business-oriented retrieval capabilities rather than direct database access.

Available operations include:

Search procurement documents
Retrieve supporting documents for an order
Compare procurement documents
Detect missing purchase orders
Execute hybrid retrieval

This abstraction allows AI assistants to interact with procurement knowledge through natural language instead of SQL queries.

Installation

Prerequisites

Before running the project, install the following software:

Dependency	Version
Python	3.11+
Git	Latest
Tesseract OCR	Latest
Claude Desktop (optional)	Latest

Clone the Repository

git clone <repository-url>

cd caseware-ai-data-mcp

Create a Virtual Environment

macOS / Linux

python -m venv env

source env/bin/activate

Windows

python -m venv env

env\Scripts\activate

Install Python Dependencies

pip install -r requirements.txt

Install OCR

# macOS

brew install tesseract

# Ubuntu

sudo apt install tesseract-ocr

# Windows

Download and install Tesseract from:

https://github.com/UB-Mannheim/tesseract/wiki

Verify the installation:

tesseract --version

Preparing the Dataset

Place the procurement documents inside the data/raw/ directory.

data/

└── raw/

    ├── contracts/

    ├── invoices/

    ├── purchase_orders/

    ├── shipping_orders/

    └── inventory_reports/

Supported document formats:

PDF
PNG
JPG
JPEG
TIFF
BMP

Note

The original procurement documents are not included in this repository because they are part of the challenge dataset. Place the provided files under data/raw/ before running the ingestion pipeline.

Running the Pipeline

Build the local knowledge base by executing:

python run_pipeline.py

Example output:

{
    "documents_processed": 45,
    "chunks_indexed": 179
}

The ingestion pipeline performs the following tasks:

Reads procurement documents
Parses PDF files
Applies OCR when required
Extracts structured metadata
Generates retrieval-ready chunks
Creates semantic embeddings
Stores metadata in SQLite
Builds the ChromaDB vector index
Creates document relationships

The pipeline is idempotent and may be executed multiple times.

Running the MCP Server

Start the MCP server:

python -m app.server

The server exposes procurement retrieval capabilities through the Model Context Protocol (MCP).

Rather than exposing raw database queries, the MCP server provides business-oriented tools that allow AI assistants to retrieve grounded procurement evidence using natural language.

Verifying the Installation

After executing the ingestion pipeline, verify that the following artifacts have been created:

storage/

├── knowledge.db

└── chroma/

The SQLite database contains:

Documents
Extracted metadata
Chunk metadata
Document relationships

The Chroma directory contains the semantic vector index.

If both artifacts exist, the knowledge base has been successfully created.

Claude Desktop Integration

The MCP server can be consumed directly from Claude Desktop, enabling natural language interaction with the procurement knowledge base.

Configure Claude Desktop

Open the Claude Desktop configuration file.

macOS

~/Library/Application Support/Claude/claude_desktop_config.json

Add the following configuration:

{
  "mcpServers": {
    "caseware-ai-data-mcp": {
      "command": "/absolute/path/to/env/bin/python",
      "args": [
        "-m",
        "app.server"
      ],
      "cwd": "/absolute/path/to/caseware-ai-data-mcp",
      "env": {
        "PYTHONPATH": "/absolute/path/to/caseware-ai-data-mcp"
      }
    }
  }
}

Replace the placeholder paths with your local project paths.

Restart Claude Desktop after saving the configuration.

Available MCP Tools

Tool	Description
`search_documents`	Semantic retrieval across indexed procurement documents
`hybrid_document_search`	Hybrid metadata + semantic retrieval
`get_documents_for_order`	Retrieves supporting procurement documents for an Order ID
`compare_documents_for_order`	Performs a lightweight procurement audit
`get_invoices_missing_purchase_orders`	Detects invoices without matching purchase orders

Quick Validation

After connecting Claude Desktop, execute the following question:

Which documents support order 10248?

Expected response:

Invoice
Purchase Order
Shipping Order

This confirms that:

the ingestion pipeline executed successfully
SQLite contains the extracted metadata
ChromaDB contains the semantic index
the MCP server is running correctly
Claude Desktop can retrieve grounded procurement evidence

Retrieval Strategy

The platform implements a lightweight Hybrid Retrieval architecture that combines deterministic metadata lookup with semantic vector search.

Metadata Retrieval

During ingestion, structured procurement entities are extracted and stored in SQLite.

Examples include:

Order IDs
Invoice Numbers
Purchase Order Numbers
Vendor Names
Dates
Amounts

Queries containing explicit identifiers are resolved through deterministic lookups, providing fast and highly accurate results.

Semantic Retrieval

Natural language questions are answered using semantic similarity search.

Document chunks are embedded using SentenceTransformers and indexed in ChromaDB.

Typical semantic queries include:

Summarize payment terms.
Find supplier obligations.
What inventory reports mention warehouse damage?
Which contracts discuss delivery conditions?

Hybrid Retrieval

The retrieval layer automatically selects the most appropriate strategy based on the query.

For example:

Which documents support order 10248?

The system:

Detects the Order ID.
Retrieves matching procurement documents from SQLite.
Complements the response with semantic evidence when applicable.
Returns grounded citations.

This approach provides better precision than relying exclusively on vector search.

Citation Strategy

Every retrieval result includes references to the original source document whenever possible.

Example:

{
  "file": "invoice_10248.pdf",
  "page": 1,
  "chunk": 0
}

This enables AI assistants to generate grounded and explainable responses rather than unsupported summaries.

Example Questions

Once connected through Claude Desktop (or another MCP-compatible client), the following questions can be executed:

Which documents support order 10248?

Compare procurement documents for order 10248.

Which invoices are missing purchase orders?

Summarize the payment terms in the supplier contract.

Find evidence related to vendor Paul Henriot.

What inventory reports are available?

Design Decisions

The implementation intentionally favors simplicity over unnecessary complexity while demonstrating the architectural patterns expected from an AI-ready data platform.

Key design decisions include:

SQLite provides a lightweight metadata store requiring no external infrastructure.
ChromaDB enables local semantic retrieval without requiring managed vector databases.
SentenceTransformers generates embeddings locally without external AI services.
FastMCP exposes business-oriented capabilities through the Model Context Protocol.
Hybrid Retrieval combines deterministic matching with semantic similarity to improve retrieval accuracy.

These choices keep the project reproducible, easy to understand, and aligned with the challenge scope.

Engineering Trade-offs

This implementation intentionally prioritizes:

Simplicity over production-scale infrastructure.
Explainability over complex AI pipelines.
Local execution over cloud deployment.
Modular design over tightly coupled components.
Deterministic metadata extraction combined with semantic retrieval.

The objective is to demonstrate sound AI Data Engineering principles rather than build a production-ready enterprise platform.

Future Improvements

Potential production enhancements include:

Schema-constrained LLM-based metadata extraction.
Confidence scoring for extracted fields.
BM25 + Vector hybrid ranking.
Line-item reconciliation across procurement documents.
Human review workflows for low-confidence matches.
OpenSearch or AWS Bedrock Knowledge Bases for cloud deployment.
Observability with LangFuse, LangSmith, or OpenTelemetry.

AI-Assisted Development

This project was developed with AI-assisted development support for architectural brainstorming, implementation scaffolding, documentation, and code refinement.

All generated code was manually reviewed, integrated, executed locally, and validated by:

Running the ingestion pipeline.
Verifying SQLite outputs.
Validating ChromaDB indexing.
Testing metadata extraction.
Executing semantic and hybrid retrieval.
Testing all MCP tools.
Validating end-to-end integration with Claude Desktop.

The final implementation, architecture, and engineering decisions were manually reviewed to ensure correctness, reproducibility, and alignment with the challenge requirements.

Why this Architecture?

The solution intentionally separates:

Ingestion
Knowledge Storage
Retrieval
MCP Interface

This modular architecture minimizes coupling and allows each layer to evolve independently.

For example:

SQLite can be replaced with PostgreSQL.
ChromaDB can be replaced with OpenSearch or another vector database.
The embedding model can be replaced without changing the retrieval layer.
OCR can be replaced without impacting downstream processing.

This design improves maintainability, extensibility, and testability while remaining intentionally lightweight for the scope of the exercise.

License

This project was developed exclusively for the Caseware AI Data Platform Take-Home Assessment.

Acknowledgements

This project was developed as part of the Caseware AI Data Platform technical assessment.

The goal was to demonstrate an AI-ready procurement knowledge platform using lightweight, explainable, and reproducible engineering practices.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured