DocShark
DocShark is an MCP server that scrapes and indexes documentation websites, enabling AI assistants to perform full-text searches on a local knowledge base built from public docs.
README
🦈 DocShark
DocShark is a powerful MCP (Model Context Protocol) server designed to scrape, index, and search any documentation website. It creates a local, highly-searchable knowledge base from public documentation pages using FTS5 (Full-Text Search) and BM25 ranking, allowing AI assistants to query the latest docs effortlessly.
🚀 Features
- Automated Crawling: Discovers pages via
sitemap.xmlwith fallback to BFS link crawling. - Smart Extraction: Uses Readability and Turndown to extract main content and convert it to clean Markdown, filtering out navbars and sidebars.
- Semantic Chunking: Splits content based on headings, preserving contextual headers for better AI understanding.
- High-Performance Search: Built-in SQLite + FTS5 indexing with BM25 ranking for accurate and lightning-fast search results.
- JS-Rendered Site Support: Tiered fetching strategy automatically detects React/Vue SPAs (empty shells) and upgrades to
puppeteer-coreif you have it installed (zero-config, auto-fallback). - Polite Crawling: Respects
robots.txtand implements rate limiting to prevent overloading documentation servers. - Standard MCP Tooling: Connect perfectly with Desktop Claude, VS Code, Cursor, and any other MCP-compatible clients via standard
stdioorhttp/ssetransports.
📦 What We Have Done (Phase 1)
Phase 1: Core Engine is fully implemented and tested.
- ✅ Custom SQLite Database with FTS5 virtual tables and auto-sync triggers.
- ✅ Web scraping engine supporting standard
fetch()andpuppeteer-core. - ✅ Markdown processor utilizing Readability + Turndown.
- ✅ Heading-based semantic chunker (500-1200 tokens per chunk).
- ✅ Asynchronous job manager and queue system.
- ✅ Complete HTTP API (REST endpoints + SSE event streams).
- ✅ Seamless integration of 4 MCP tools:
manage_library,search_docs,list_libraries, andget_doc_page. - ✅ Robust CLI interface (
start,add,rename,search,list).
🏗️ What We Are Doing
We are actively polishing the integration between the core engine and external MCP clients (like VS Code Agents and Claude Desktop).
🔮 What We Plan To Do (Phase 2 & Beyond)
- Web Dashboard: An intuitive SvelteKit dashboard to manage your synced libraries, view crawl progress in real-time (via SSE), and test searches manually.
- Incremental Crawling: Smarter
refreshjobs that compareETagandLast-Modifiedheaders to only re-scrape updated pages. - Vector Search (RAG): Integration of lightweight vector embeddings for semantic similarity search alongside the existing FTS5 keyword search.
- Advanced Scraping Setup: Support for custom CSS selectors to define exactly where content lives in non-standard documentation websites.
🛠️ Usage
Quick Start (from npm)
You can run DocShark directly without installing it globally using bunx:
# Add a documentation library to the index
bunx docshark add https://valibot.dev/guides/ --depth 2
# Search your indexed docs
bunx docshark search "schema validation"
Installation
To install DocShark globally as a CLI tool:
DocShark is intended to be installed and run with Bun.
# Global Bun installation
bun add -g docshark
After installation, you can use the docshark command:
docshark list
# Update the global Bun installation when a new release is published
docshark update
# Script-friendly update check
docshark update --check --quiet
Interactive CLI runs will also let you know when a newer version is available. Update notices are intentionally skipped for MCP stdio mode so they never interfere with protocol output.
For scripts, docshark update --check exits 0 when current, 10 when a newer version is available, and 1 when the version check could not be completed.
🧠 Agent Skills
DocShark includes official Agent Skills available on the skills.sh registry. These skills teach AI assistants exactly how to set up, use, and troubleshoot the DocShark MCP server.
To install a skill directly into your AI coding assistant:
# Add the 'docshark' skill for using the MCP tools
npx skills add Michael-Obele/docshark --skill docshark
# Add the 'using-docshark' skill for setup and configuration help
npx skills add Michael-Obele/docshark --skill using-docshark
Skill Setup by Code Editor
The npx skills add CLI automatically configures skills for most editors, but here is how they integrate:
- Cursor: Skills are added to
.cursor/rules/ - Windsurf: Skills are added to
.windsurfrules - VS Code (Cline / Roo Code): Skills are added to
.clinerulesor.roomodes - Trae: Skills are added to
.trae/skills/ - GitHub Copilot: Skills are appended to
.github/copilot-instructions.md
Check out the skills/README.md for detailed workflows on how these skills optimize your AI coding experience.
🔌 MCP Integration
VS Code (GitHub Copilot / MCP Extension)
Add DocShark to your .vscode/settings.json or global MCP configuration:
{
"mcpServers": {
"docshark": {
"command": "bunx",
"args": ["-y", "docshark", "start", "--stdio"]
}
}
}
Cursor
- Open Cursor Settings > Models > MCP.
- Click + Add New MCP Server.
- Name:
docshark - Type:
command - Command:
bunx -y docshark start --stdio
Claude Desktop
Edit your Claude Desktop configuration file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"docshark": {
"command": "bunx",
"args": ["-y", "docshark", "start", "--stdio"]
}
}
}
🛠️ Development
Local Setup
Ensure you have Bun installed.
# Clone the repository
git clone https://github.com/Michael-Obele/docshark.git
cd docshark
# Install dependencies
bun install
# (Optional) Enable auto-detection & scraping of Javascript React/Vue single-page apps
bun add puppeteer-core
# Start the DocShark MCP server in HTTP mode for local testing
bun run src/cli.ts start --port 6380
Local CLI Debugging
# Run CLI directly while developing
bun run src/cli.ts list
Tests
Run the core regression suite before merging or publishing changes:
# From the repo root
pnpm test:core
# Or from packages/core
bun test scripts/*.test.ts
The suite covers the current core engine surfaces: SQLite storage and migrations, library management, extraction, chunking, search, crawl helpers, API routes, and MCP tool wrappers.
🔄 Versioning & Changelog
This project uses Google's Release Please to automate versioning and changelog generation.
- Semantic Versioning: Our versions automatically bump (e.g.
0.0.1->0.0.2or0.1.0) based on standard Conventional Commits (feat:,fix:,chore:, etc.). - Automated: A PR is automatically created on
masterwhen standard commits are merged, generating a standardCHANGELOG.md.
📜 License
This project is open-source and available under the MIT License.
Built to empower AI agents with the latest knowledge.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.