DocShark

DocShark

DocShark is an MCP server that scrapes and indexes documentation websites, enabling AI assistants to perform full-text searches on a local knowledge base built from public docs.

Category
Visit Server

README

🦈 DocShark

Built with Bun NPM Version MCP Compatible GitHub Release License: MIT

DocShark is a powerful MCP (Model Context Protocol) server designed to scrape, index, and search any documentation website. It creates a local, highly-searchable knowledge base from public documentation pages using FTS5 (Full-Text Search) and BM25 ranking, allowing AI assistants to query the latest docs effortlessly.


🚀 Features

  • Automated Crawling: Discovers pages via sitemap.xml with fallback to BFS link crawling.
  • Smart Extraction: Uses Readability and Turndown to extract main content and convert it to clean Markdown, filtering out navbars and sidebars.
  • Semantic Chunking: Splits content based on headings, preserving contextual headers for better AI understanding.
  • High-Performance Search: Built-in SQLite + FTS5 indexing with BM25 ranking for accurate and lightning-fast search results.
  • JS-Rendered Site Support: Tiered fetching strategy automatically detects React/Vue SPAs (empty shells) and upgrades to puppeteer-core if you have it installed (zero-config, auto-fallback).
  • Polite Crawling: Respects robots.txt and implements rate limiting to prevent overloading documentation servers.
  • Standard MCP Tooling: Connect perfectly with Desktop Claude, VS Code, Cursor, and any other MCP-compatible clients via standard stdio or http/sse transports.

📦 What We Have Done (Phase 1)

Phase 1: Core Engine is fully implemented and tested.

  • ✅ Custom SQLite Database with FTS5 virtual tables and auto-sync triggers.
  • ✅ Web scraping engine supporting standard fetch() and puppeteer-core.
  • ✅ Markdown processor utilizing Readability + Turndown.
  • ✅ Heading-based semantic chunker (500-1200 tokens per chunk).
  • ✅ Asynchronous job manager and queue system.
  • ✅ Complete HTTP API (REST endpoints + SSE event streams).
  • ✅ Seamless integration of 4 MCP tools: manage_library, search_docs, list_libraries, and get_doc_page.
  • ✅ Robust CLI interface (start, add, rename, search, list).

🏗️ What We Are Doing

We are actively polishing the integration between the core engine and external MCP clients (like VS Code Agents and Claude Desktop).

🔮 What We Plan To Do (Phase 2 & Beyond)

  • Web Dashboard: An intuitive SvelteKit dashboard to manage your synced libraries, view crawl progress in real-time (via SSE), and test searches manually.
  • Incremental Crawling: Smarter refresh jobs that compare ETag and Last-Modified headers to only re-scrape updated pages.
  • Vector Search (RAG): Integration of lightweight vector embeddings for semantic similarity search alongside the existing FTS5 keyword search.
  • Advanced Scraping Setup: Support for custom CSS selectors to define exactly where content lives in non-standard documentation websites.

🛠️ Usage

Quick Start (from npm)

You can run DocShark directly without installing it globally using bunx:

# Add a documentation library to the index
bunx docshark add https://valibot.dev/guides/ --depth 2

# Search your indexed docs
bunx docshark search "schema validation"

Installation

To install DocShark globally as a CLI tool:

DocShark is intended to be installed and run with Bun.

# Global Bun installation
bun add -g docshark

After installation, you can use the docshark command:

docshark list

# Update the global Bun installation when a new release is published
docshark update

# Script-friendly update check
docshark update --check --quiet

Interactive CLI runs will also let you know when a newer version is available. Update notices are intentionally skipped for MCP stdio mode so they never interfere with protocol output.

For scripts, docshark update --check exits 0 when current, 10 when a newer version is available, and 1 when the version check could not be completed.

🧠 Agent Skills

DocShark includes official Agent Skills available on the skills.sh registry. These skills teach AI assistants exactly how to set up, use, and troubleshoot the DocShark MCP server.

To install a skill directly into your AI coding assistant:

# Add the 'docshark' skill for using the MCP tools
npx skills add Michael-Obele/docshark --skill docshark

# Add the 'using-docshark' skill for setup and configuration help
npx skills add Michael-Obele/docshark --skill using-docshark

Skill Setup by Code Editor

The npx skills add CLI automatically configures skills for most editors, but here is how they integrate:

  • Cursor: Skills are added to .cursor/rules/
  • Windsurf: Skills are added to .windsurfrules
  • VS Code (Cline / Roo Code): Skills are added to .clinerules or .roomodes
  • Trae: Skills are added to .trae/skills/
  • GitHub Copilot: Skills are appended to .github/copilot-instructions.md

Check out the skills/README.md for detailed workflows on how these skills optimize your AI coding experience.

🔌 MCP Integration

VS Code (GitHub Copilot / MCP Extension)

Add DocShark to your .vscode/settings.json or global MCP configuration:

{
  "mcpServers": {
    "docshark": {
      "command": "bunx",
      "args": ["-y", "docshark", "start", "--stdio"]
    }
  }
}

Cursor

  1. Open Cursor Settings > Models > MCP.
  2. Click + Add New MCP Server.
  3. Name: docshark
  4. Type: command
  5. Command: bunx -y docshark start --stdio

Claude Desktop

Edit your Claude Desktop configuration file:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "docshark": {
      "command": "bunx",
      "args": ["-y", "docshark", "start", "--stdio"]
    }
  }
}

🛠️ Development

Local Setup

Ensure you have Bun installed.

# Clone the repository
git clone https://github.com/Michael-Obele/docshark.git
cd docshark

# Install dependencies
bun install

# (Optional) Enable auto-detection & scraping of Javascript React/Vue single-page apps
bun add puppeteer-core

# Start the DocShark MCP server in HTTP mode for local testing
bun run src/cli.ts start --port 6380

Local CLI Debugging

# Run CLI directly while developing
bun run src/cli.ts list

Tests

Run the core regression suite before merging or publishing changes:

# From the repo root
pnpm test:core

# Or from packages/core
bun test scripts/*.test.ts

The suite covers the current core engine surfaces: SQLite storage and migrations, library management, extraction, chunking, search, crawl helpers, API routes, and MCP tool wrappers.

🔄 Versioning & Changelog

This project uses Google's Release Please to automate versioning and changelog generation.

  • Semantic Versioning: Our versions automatically bump (e.g. 0.0.1 -> 0.0.2 or 0.1.0) based on standard Conventional Commits (feat:, fix:, chore:, etc.).
  • Automated: A PR is automatically created on master when standard commits are merged, generating a standard CHANGELOG.md.

📜 License

This project is open-source and available under the MIT License.


Built to empower AI agents with the latest knowledge.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured