caseware-ai-procurement-knowledge-platform
Enables AI assistants to retrieve, search, and compare procurement documents using hybrid retrieval and MCP integration.
README
Caseware AI Procurement Knowledge Platform
AI-ready procurement knowledge platform built with a local data pipeline, hybrid retrieval, and the Model Context Protocol (MCP).
Overview
This project implements an end-to-end AI-ready data platform for procurement and inventory documents.
The solution demonstrates how structured and unstructured business documents can be ingested, transformed into searchable knowledge, and exposed through an MCP (Model Context Protocol) server, allowing AI assistants to retrieve evidence, reason across related documents, and generate grounded responses with source references.
The implementation intentionally remains lightweight and fully local while showcasing modern AI Data Engineering concepts, including:
- PDF and image ingestion
- OCR fallback for scanned documents
- Structured metadata extraction
- Semantic embeddings
- Hybrid retrieval
- Cross-document relationship matching
- MCP tool integration
- Grounded AI responses
The architecture prioritizes simplicity, explainability, and reproducibility, following the challenge recommendation to avoid over-engineering.
Key Capabilities
- PDF document ingestion
- OCR using Tesseract
- Native PDF parsing with PyMuPDF
- Structured metadata extraction
- SQLite metadata store
- ChromaDB vector database
- SentenceTransformers embeddings
- Hybrid retrieval (metadata + semantic search)
- Cross-document relationship matching
- Procurement document comparison
- MCP server integration
- Claude Desktop integration
- Grounded source citations
Design Goals
The solution was intentionally designed to demonstrate the core architectural components of an AI-ready data platform while keeping the implementation easy to understand and reproduce.
Primary goals include:
- Reproducible local execution
- AI-ready document preparation
- Explainable retrieval
- Hybrid search combining deterministic metadata and semantic similarity
- Modular architecture with clear separation of concerns
- Agent integration through MCP
Rather than focusing on production-scale infrastructure, the implementation emphasizes engineering decisions, maintainability, and retrieval quality.
High-Level Architecture
flowchart TD
A[Raw Procurement Documents]
A --> B[PDF Parser]
A --> C[OCR - Tesseract]
B --> D[Extracted Text]
C --> D
D --> E[Chunking]
E --> F[Metadata Extraction]
E --> G[SentenceTransformers Embeddings]
F --> H[(SQLite)]
G --> I[(ChromaDB)]
H --> J[Hybrid Retrieval Layer]
I --> J
J --> K[FastMCP Server]
K --> L[Claude Desktop]
Data Flow
The ingestion pipeline performs the following steps:
- Load procurement documents from the local filesystem.
- Parse native PDFs using PyMuPDF.
- Apply OCR to scanned documents using Tesseract.
- Normalize extracted text.
- Extract structured procurement metadata.
- Split documents into retrieval-ready chunks.
- Generate semantic embeddings.
- Store structured metadata in SQLite.
- Store semantic vectors in ChromaDB.
- Expose retrieval capabilities through an MCP server.
Technology Stack
| Layer | Technology |
|---|---|
| Language | Python |
| PDF Parsing | PyMuPDF |
| OCR | Tesseract |
| Embeddings | SentenceTransformers |
| Metadata Store | SQLite |
| Vector Database | ChromaDB |
| MCP Framework | FastMCP |
| AI Client | Claude Desktop |
Project Structure
The project is organized into independent modules following a clear separation of concerns. Each component has a single responsibility, making the solution easier to understand, maintain, and extend.
caseware-ai-data-mcp/
│
├── app/
│ ├── pipeline/
│ │ ├── extract.py # PDF parsing and OCR
│ │ ├── chunk.py # Document chunking
│ │ ├── model.py # Metadata extraction
│ │ ├── ingest.py # End-to-end ingestion pipeline
│ │ └── index.py # ChromaDB indexing
│ │
│ ├── retrieval/
│ │ ├── search.py # Semantic retrieval
│ │ ├── matching.py # Cross-document matching
│ │ ├── hybrid.py # Hybrid retrieval
│ │ └── citations.py # Source references
│ │
│ ├── db.py # SQLite initialization
│ └── server.py # FastMCP server
│
├── data/
│ └── raw/ # Procurement documents
│
├── storage/
│ ├── knowledge.db # SQLite metadata store
│ └── chroma/ # ChromaDB vector index
│
├── run_pipeline.py
├── requirements.txt
└── README.md
Module Responsibilities
Pipeline
The pipeline transforms raw procurement documents into AI-ready knowledge.
Responsibilities include:
- Reading procurement documents
- Parsing PDF files
- Running OCR when required
- Extracting structured metadata
- Chunking document content
- Generating semantic embeddings
- Populating SQLite
- Building the ChromaDB vector index
Retrieval
The retrieval layer is responsible for answering user questions.
It combines two complementary strategies:
- Deterministic metadata lookup
- Semantic vector search
This hybrid approach improves retrieval precision while maintaining flexibility for natural language queries.
Storage
Structured and semantic information are intentionally stored separately.
# SQLite
Stores:
- Document metadata
- Extracted procurement fields
- Chunk metadata
- Document relationships
# ChromaDB
Stores:
- Sentence embeddings
- Semantic vector index
Separating these responsibilities keeps the architecture simple while allowing each technology to focus on its strengths.
MCP Server
The FastMCP server exposes business-oriented retrieval capabilities rather than direct database access.
Available operations include:
- Search procurement documents
- Retrieve supporting documents for an order
- Compare procurement documents
- Detect missing purchase orders
- Execute hybrid retrieval
This abstraction allows AI assistants to interact with procurement knowledge through natural language instead of SQL queries.
Installation
Prerequisites
Before running the project, install the following software:
| Dependency | Version |
|---|---|
| Python | 3.11+ |
| Git | Latest |
| Tesseract OCR | Latest |
| Claude Desktop (optional) | Latest |
Clone the Repository
git clone <repository-url>
cd caseware-ai-data-mcp
Create a Virtual Environment
macOS / Linux
python -m venv env
source env/bin/activate
Windows
python -m venv env
env\Scripts\activate
Install Python Dependencies
pip install -r requirements.txt
Install OCR
# macOS
brew install tesseract
# Ubuntu
sudo apt install tesseract-ocr
# Windows
Download and install Tesseract from:
https://github.com/UB-Mannheim/tesseract/wiki
Verify the installation:
tesseract --version
Preparing the Dataset
Place the procurement documents inside the data/raw/ directory.
data/
└── raw/
├── contracts/
├── invoices/
├── purchase_orders/
├── shipping_orders/
└── inventory_reports/
Supported document formats:
- PNG
- JPG
- JPEG
- TIFF
- BMP
Note
The original procurement documents are not included in this repository because they are part of the challenge dataset. Place the provided files under
data/raw/before running the ingestion pipeline.
Running the Pipeline
Build the local knowledge base by executing:
python run_pipeline.py
Example output:
{
"documents_processed": 45,
"chunks_indexed": 179
}
The ingestion pipeline performs the following tasks:
- Reads procurement documents
- Parses PDF files
- Applies OCR when required
- Extracts structured metadata
- Generates retrieval-ready chunks
- Creates semantic embeddings
- Stores metadata in SQLite
- Builds the ChromaDB vector index
- Creates document relationships
The pipeline is idempotent and may be executed multiple times.
Running the MCP Server
Start the MCP server:
python -m app.server
The server exposes procurement retrieval capabilities through the Model Context Protocol (MCP).
Rather than exposing raw database queries, the MCP server provides business-oriented tools that allow AI assistants to retrieve grounded procurement evidence using natural language.
Verifying the Installation
After executing the ingestion pipeline, verify that the following artifacts have been created:
storage/
├── knowledge.db
└── chroma/
The SQLite database contains:
- Documents
- Extracted metadata
- Chunk metadata
- Document relationships
The Chroma directory contains the semantic vector index.
If both artifacts exist, the knowledge base has been successfully created.
Claude Desktop Integration
The MCP server can be consumed directly from Claude Desktop, enabling natural language interaction with the procurement knowledge base.
Configure Claude Desktop
Open the Claude Desktop configuration file.
macOS
~/Library/Application Support/Claude/claude_desktop_config.json
Add the following configuration:
{
"mcpServers": {
"caseware-ai-data-mcp": {
"command": "/absolute/path/to/env/bin/python",
"args": [
"-m",
"app.server"
],
"cwd": "/absolute/path/to/caseware-ai-data-mcp",
"env": {
"PYTHONPATH": "/absolute/path/to/caseware-ai-data-mcp"
}
}
}
}
Replace the placeholder paths with your local project paths.
Restart Claude Desktop after saving the configuration.
Available MCP Tools
| Tool | Description |
|---|---|
search_documents |
Semantic retrieval across indexed procurement documents |
hybrid_document_search |
Hybrid metadata + semantic retrieval |
get_documents_for_order |
Retrieves supporting procurement documents for an Order ID |
compare_documents_for_order |
Performs a lightweight procurement audit |
get_invoices_missing_purchase_orders |
Detects invoices without matching purchase orders |
Quick Validation
After connecting Claude Desktop, execute the following question:
Which documents support order 10248?
Expected response:
- Invoice
- Purchase Order
- Shipping Order
This confirms that:
- the ingestion pipeline executed successfully
- SQLite contains the extracted metadata
- ChromaDB contains the semantic index
- the MCP server is running correctly
- Claude Desktop can retrieve grounded procurement evidence
Retrieval Strategy
The platform implements a lightweight Hybrid Retrieval architecture that combines deterministic metadata lookup with semantic vector search.
Metadata Retrieval
During ingestion, structured procurement entities are extracted and stored in SQLite.
Examples include:
- Order IDs
- Invoice Numbers
- Purchase Order Numbers
- Vendor Names
- Dates
- Amounts
Queries containing explicit identifiers are resolved through deterministic lookups, providing fast and highly accurate results.
Semantic Retrieval
Natural language questions are answered using semantic similarity search.
Document chunks are embedded using SentenceTransformers and indexed in ChromaDB.
Typical semantic queries include:
- Summarize payment terms.
- Find supplier obligations.
- What inventory reports mention warehouse damage?
- Which contracts discuss delivery conditions?
Hybrid Retrieval
The retrieval layer automatically selects the most appropriate strategy based on the query.
For example:
Which documents support order 10248?
The system:
- Detects the Order ID.
- Retrieves matching procurement documents from SQLite.
- Complements the response with semantic evidence when applicable.
- Returns grounded citations.
This approach provides better precision than relying exclusively on vector search.
Citation Strategy
Every retrieval result includes references to the original source document whenever possible.
Example:
{
"file": "invoice_10248.pdf",
"page": 1,
"chunk": 0
}
This enables AI assistants to generate grounded and explainable responses rather than unsupported summaries.
Example Questions
Once connected through Claude Desktop (or another MCP-compatible client), the following questions can be executed:
Which documents support order 10248?
Compare procurement documents for order 10248.
Which invoices are missing purchase orders?
Summarize the payment terms in the supplier contract.
Find evidence related to vendor Paul Henriot.
What inventory reports are available?
Design Decisions
The implementation intentionally favors simplicity over unnecessary complexity while demonstrating the architectural patterns expected from an AI-ready data platform.
Key design decisions include:
- SQLite provides a lightweight metadata store requiring no external infrastructure.
- ChromaDB enables local semantic retrieval without requiring managed vector databases.
- SentenceTransformers generates embeddings locally without external AI services.
- FastMCP exposes business-oriented capabilities through the Model Context Protocol.
- Hybrid Retrieval combines deterministic matching with semantic similarity to improve retrieval accuracy.
These choices keep the project reproducible, easy to understand, and aligned with the challenge scope.
Engineering Trade-offs
This implementation intentionally prioritizes:
- Simplicity over production-scale infrastructure.
- Explainability over complex AI pipelines.
- Local execution over cloud deployment.
- Modular design over tightly coupled components.
- Deterministic metadata extraction combined with semantic retrieval.
The objective is to demonstrate sound AI Data Engineering principles rather than build a production-ready enterprise platform.
Future Improvements
Potential production enhancements include:
- Schema-constrained LLM-based metadata extraction.
- Confidence scoring for extracted fields.
- BM25 + Vector hybrid ranking.
- Line-item reconciliation across procurement documents.
- Human review workflows for low-confidence matches.
- OpenSearch or AWS Bedrock Knowledge Bases for cloud deployment.
- Observability with LangFuse, LangSmith, or OpenTelemetry.
AI-Assisted Development
This project was developed with AI-assisted development support for architectural brainstorming, implementation scaffolding, documentation, and code refinement.
All generated code was manually reviewed, integrated, executed locally, and validated by:
- Running the ingestion pipeline.
- Verifying SQLite outputs.
- Validating ChromaDB indexing.
- Testing metadata extraction.
- Executing semantic and hybrid retrieval.
- Testing all MCP tools.
- Validating end-to-end integration with Claude Desktop.
The final implementation, architecture, and engineering decisions were manually reviewed to ensure correctness, reproducibility, and alignment with the challenge requirements.
Why this Architecture?
The solution intentionally separates:
- Ingestion
- Knowledge Storage
- Retrieval
- MCP Interface
This modular architecture minimizes coupling and allows each layer to evolve independently.
For example:
- SQLite can be replaced with PostgreSQL.
- ChromaDB can be replaced with OpenSearch or another vector database.
- The embedding model can be replaced without changing the retrieval layer.
- OCR can be replaced without impacting downstream processing.
This design improves maintainability, extensibility, and testability while remaining intentionally lightweight for the scope of the exercise.
License
This project was developed exclusively for the Caseware AI Data Platform Take-Home Assessment.
Acknowledgements
This project was developed as part of the Caseware AI Data Platform technical assessment.
The goal was to demonstrate an AI-ready procurement knowledge platform using lightweight, explainable, and reproducible engineering practices.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.