DocPulse

DocPulse

DocPulse enables users to distill large documentation into compact, LLM-ready summaries by crawling web, PDF, and local sources, augmented with community insights from Reddit/StackOverflow, all running locally on Apple Silicon.

Category
Visit Server

README

๐Ÿฉบ DocPulse

DocPulse is a local-first MCP (Model Context Protocol) Server that transforms massive documentation into high-density, LLM-ready "Implementation Manifestos."

Built for Apple Silicon, it leverages native MLX for local inference, combined with intelligent web crawling and community context search (Reddit/StackOverflow) to provide a 360-degree view of any library, standard, or regulation.

License: MIT Code of Conduct


๐ŸŽฏ Why DocPulse?

  • Token Efficiency: Scraping and distilling docs locally saves a massive amount of tokens. Instead of sending thousands of raw HTML lines to a cloud LLM, you send only a dense, 2-3 page distilled manifesto.
  • Internal Network Support: Designed to work seamlessly within organizational networks. If your documentation is hosted on an internal wiki or API that doesn't require MFA/Auth, DocPulse can ingest it and provide context without exposing sensitive raw data to external scraping services.

๐Ÿš€ Key Features

  • Multi-Source Ingestion:
    • Web: Intelligent crawling that bypasses JS-heavy UI noise.
    • PDF: Deep parsing of regulatory or technical PDF documents.
    • Local Files: Ingest single .md, .txt, .py, etc.
    • Directories: Recursive scanning of entire folders (local or mounted remote drives like OneDrive).
  • Local LLM Distillation: Uses mlx-lm with DeepSeek models to extract API signatures, version constraints, and logical edge cases.
  • Community Augmentation: Automatically fetches recent community discussions to identify undocumented bugs or workarounds.
  • Fixed-Resource Optimal Sizing: By default, strictly utilizes a highly optimized 7B distillation model to maximize extraction speed and save RAM for other coding agents without losing extraction accuracy.
  • Human-in-the-Loop Feedback: Save human corrections that are injected into future distillation runs for the same subject.
  • File-System Caching: Fast retrieval of previously synthesized context.

๐Ÿง  Intelligent Defaults

DocPulse is designed for a seamless, zero-config startup experience.

  • Dynamic Model Selection: On launch, DocPulse detects your system's total RAM and automatically selects the most capable model from our curated DeepSeek-R1 Distill suite:
    • < 24GB RAM: 7B (High-speed, minimal overhead).
    • 24GB - 64GB RAM: 14B (Deep extraction & reasoning).
    • > 64GB RAM: 32B (Maximum fidelity for complex standards).
  • Auto-Bootstrapping: The system automatically initializes your local config at ~/.config/docpulse/, creates the required cache directories, and downloads MLX model weights on demand.
  • Environment Configuration: We provide a comprehensive .env.example template. Simply cp .env.example .env to manage optional search API keys (Brave/Google) or force a specific model size using the DOCPULSE_MODEL override.

๐Ÿ› ๏ธ Requirements

  • Hardware: Apple Silicon (M1, M2, M3, M4).
  • Software: Python 3.10+, uv recommended.
  • Environment: macOS (optimized for unified memory).

๐Ÿ“ฆ Installation

  1. Clone the repository:

    git clone https://github.com/your-username/docpulse.git
    cd docpulse
    
  2. Setup with uv:

    uv sync
    
  3. Install crawl4ai dependencies:

    uv run crawl4ai-setup
    
  4. Configure environment:

    cp .env.example .env
    # Edit .env with your preference/keys
    
  5. Run the CLI:

    uv run docpulse get fastapi --source="https://fastapi.tiangolo.com/tutorial/"
    

โšก Quick Start (Claude Desktop)

  1. Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Clone & Sync:
    git clone https://github.com/your-username/docpulse.git && cd docpulse && uv sync
    
  3. Configure: cp .env.example .env (Add search keys if desired)
  4. Add to Claude: Add the config below to your claude_desktop_config.json.
  5. Start Coding: Ask Claude: "Analyze the FastAPI documentation for memory management patterns."

โš™๏ธ Configuration

DocPulse features a robust, tiered configuration system.

1. Configuration Tiers

Settings are loaded in the following priority:

  1. User Config: ~/.config/docpulse/config.toml
  2. Default Config: config.default.toml (bundled with the repo)

2. Configurable Settings

You can override any of these keys in your config.toml:

[app]

  • name: The name of the MCP server.
  • log_level: Set to DEBUG, INFO, WARNING, or ERROR.

[harvester]

  • text_extensions: List of file extensions to include in recursive scans.
  • encoding: Default encoding for local files (default: utf-8).

[distiller]

  • max_tokens: Maximum output length for the manifesto.
  • temperature: LLM sampling temperature.

[prompts]

Prompts are no longer stored in the TOML configuration. Instead, they live in the prompts/ directory.

  • Priority: ~/.config/docpulse/prompts/{name}.txt > prompts/{name}.txt.
  • Placeholder tokens: {raw_text}, {community_context}, {human_feedback}.

[models.entries]

Maps model repository strings to minimum RAM requirements (GB).

3. Environment Variables (.env)

Used for sensitive keys and quick overrides:

  • DOCPULSE_MODEL: The pipeline strictly defaults to a 7B model because data extraction relies heavily on deletion/formatting rather than novel synthesis. Overtaxing VRAM with a 32B model causes severe bottlenecks. If you must override this, set this variable to 14B, 32B, or a full HuggingFace repo link.
  • DOCPULSE_CACHE_DIR: Set the directory where distilled documentation is saved (defaults to .docpulse_cache in the current working directory).
  • BRAVE_API_KEY: For Brave Search augmentation.
  • GOOGLE_API_KEY & GOOGLE_CSE_ID: For Google Search augmentation.
  • Note: DuckDuckGo is the default and requires no key.

๐Ÿงฉ MCP Integration

Add to Claude Desktop

Add the following to your claude_desktop_config.json:

{
  "mcpServers": {
    "docpulse": {
      "command": "uv",
      "args": ["--directory", "/path/to/docpulse", "run", "python", "server.py"]
    }
  }
}

๐Ÿงฐ Tools Exposed

get_universal_context

Primary tool for creating or retrieving documentation context.

  • Arguments:
    • subject: Name (e.g., fastapi).
    • version: Version string (e.g., v0.115).
    • source: URL, absolute file path, or absolute directory path.
    • topic_keywords: (Optional) Keywords for community search.

report_context_failure

Allows developers to correct the server's output.

  • Arguments:
    • subject: The subject being corrected.
    • feedback: Detailed workaround or bug fix.
  • DocPulse will inject this feedback into the prompt the next time you request the same subject.


๐Ÿงช Testing

DocPulse provides several ways to test the server without requiring an LLM:

1. Dedicated CLI (On-Demand)

Run DocPulse directly from your terminal for one-off distillations:

# Get context for a subject
uv run docpulse get fastapi --source "https://fastapi.tiangolo.com/tutorial/"

# Report a failure or add feedback
uv run docpulse report fastapi "The async client has change in version 0.115"

2. Automated E2E Script

Run the provided E2E test suite which verifies CLI execution and cache persistence:

./scripts/test_e2e.sh

3. Visual Debugging (MCP Inspector)

Open the interactive MCP Inspector to test tools via a web UI:

uv run fastmcp dev server.py

4. Unit & Multi-Layer Tests

Run the standard pytest suite:

uv run pytest tests/ -v

๐Ÿ’พ Persistence & Survival

DocPulse is designed to survive reboots and server restarts without losing data.

  • File-System Cache: All distilled context is saved to a local caching directory. By default, this is an operational .docpulse_cache/ directory in the current working directory. You can override this local folder using the DOCPULSE_CACHE_DIR environment variable.
  • Automatic Directory Management: The application will automatically ensure the caching directory exists before saving files to it, keeping things zero-configuration.
  • Cache-First Logic: Before performing a new harvest or distillation, the server checks the caching directory. If a match is found, it returns the stored result instantly.
  • Feedback Loop: Human feedback and failure reports are persisted in the feedback/ subdirectory of the cache and are automatically injected into future distillation prompts for that subject.

๐Ÿค Community & Governance

๐ŸŒŸ Pull Requests We'd Love to See

  • Platform Agnosticism: Currently, DocPulse is optimized for Apple Silicon via MLX. We invite PRs to support other backends (llama.cpp, ONNX, etc.) to make the system truly universal.
  • Integration Plugins: Right now, DocPulse works best with direct API/Web access. We welcome PRs for plugins that integrate with specific documentation platforms (Confluence, Notion, SharePoint, etc.) where documentation often lives.

๐Ÿ“œ License

MIT

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured