AIMLPM/markcrawl
Crawl any website into clean Markdown, search through pages, read full content, and extract structured data using OpenAI, Claude, Gemini, or Grok β with auto-citation and resume support.
README
MarkCrawl by iD8 π·οΈπ
Turn any website into clean Markdown for LLM pipelines β in one command.
pip install markcrawl
markcrawl --base https://docs.example.com --out ./output --show-progress
MarkCrawl is a crawl-and-structure engine. It crawls a website, strips navigation/scripts/boilerplate, and writes clean Markdown files with a structured JSONL index. Every page includes a citation with the access date. No API keys needed.
Everything else β LLM extraction, Supabase upload, MCP server, LangChain tools β is optional and installed separately.
Quickstart (2 minutes)
pip install markcrawl
markcrawl --base https://httpbin.org --out ./demo --show-progress
Your ./demo folder now contains:
demo/
βββ index__a4f3b2c1d0.md β clean Markdown of the page
βββ pages.jsonl β structured index (one JSON line per page)
Each line in pages.jsonl:
{
"url": "https://httpbin.org/",
"title": "httpbin.org",
"crawled_at": "2026-04-04T12:30:00Z",
"citation": "httpbin.org. httpbin.org. Available at: https://httpbin.org/ [Accessed April 04, 2026].",
"tool": "markcrawl",
"text": "# httpbin.org\n\nA simple HTTP Request & Response Service..."
}
<details> <summary>How it compares to other crawlers</summary>
Different tools make different tradeoffs. This table summarizes the main differences:
| MarkCrawl | FireCrawl | Crawl4AI | Scrapy | |
|---|---|---|---|---|
| License | MIT | AGPL-3.0 | Apache-2.0 | BSD-3 |
| Install | pip install markcrawl |
SaaS or self-host | pip + Playwright | pip + framework |
| Output | Markdown + JSONL | Markdown + JSON | Markdown | Custom pipelines |
| JS rendering | Optional (--render-js) |
Built-in | Built-in | Plugin |
| LLM extraction | Optional add-on | Via API | Built-in | None |
| Best for | Single-site crawl β Markdown | Hosted scraping API | AI-native crawling | Large-scale distributed |
Each tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep browser automation, and Scrapy handles massive distributed workloads. MarkCrawl focuses on simple local crawls that produce LLM-ready Markdown.
See benchmarks/SPEED_COMPARISON.md for head-to-head performance data (3 tools, 4 sites, 3 iterations each). </details>
Installation
The core crawler is the only thing you need. Everything else is optional.
pip install markcrawl # Core crawler (free, no API keys)
Optional add-ons:
pip install markcrawl[extract] # + LLM extraction (OpenAI, Claude, Gemini, Grok)
pip install markcrawl[js] # + JavaScript rendering (Playwright)
pip install markcrawl[upload] # + Supabase upload with embeddings
pip install markcrawl[mcp] # + MCP server for AI agents
pip install markcrawl[langchain] # + LangChain tool wrappers
pip install markcrawl[all] # Everything
For Playwright, also run playwright install chromium after installing.
<details> <summary>Install from source (for development)</summary>
git clone https://github.com/AIMLPM/markcrawl.git
cd markcrawl
python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"
</details>
Crawling
markcrawl --base https://www.example.com --out ./output --show-progress
Add flags as needed:
markcrawl \
--base https://www.example.com \
--out ./output \
--include-subdomains \ # crawl sub.example.com too
--render-js \ # render JavaScript (React, Vue, etc.)
--concurrency 5 \ # fetch 5 pages in parallel
--proxy http://proxy:8080 \ # route through a proxy
--max-pages 200 \ # stop after 200 pages
--format markdown \ # or "text" for plain text
--show-progress
Resume an interrupted crawl:
markcrawl --base https://www.example.com --out ./output --resume --show-progress
Output
Each page becomes a .md file with a citation header:
# Getting Started
> URL: https://docs.example.com/getting-started
> Crawled: April 04, 2026
> Citation: Getting Started. docs.example.com. Available at: https://docs.example.com/getting-started [Accessed April 04, 2026].
Welcome to the platform. This guide walks you through installation...
Navigation, footer, cookie banners, and scripts are stripped. Only the main content remains.
<details> <summary>All crawler CLI arguments</summary>
| Argument | Description |
|---|---|
--base |
Base site URL to crawl |
--out |
Output directory |
--format |
markdown or text (default: markdown) |
--show-progress |
Print progress and crawl events |
--render-js |
Render JavaScript with Playwright before extracting |
--concurrency |
Pages to fetch in parallel (default: 1) |
--proxy |
HTTP/HTTPS proxy URL |
--resume |
Resume from saved state |
--include-subdomains |
Include subdomains under the base domain |
--max-pages |
Max pages to save; 0 = unlimited (default: 500) |
--delay |
Minimum delay between requests in seconds (default: 0, adaptive throttle adjusts automatically) |
--timeout |
Per-request timeout in seconds (default: 15) |
--min-words |
Skip pages with fewer words (default: 20) |
--user-agent |
Override the default user agent |
--use-sitemap / --no-sitemap |
Enable/disable sitemap discovery |
| </details> |
Optional: structured extraction
If you need structured data (not just text), the extraction add-on uses an LLM to pull specific fields from each page.
pip install markcrawl[extract]
markcrawl-extract \
--jsonl ./output/pages.jsonl \
--fields company_name pricing features \
--show-progress
Auto-discover fields across multiple crawled sites:
markcrawl-extract \
--jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
--auto-fields \
--context "competitor pricing analysis" \
--show-progress
Supports OpenAI, Anthropic (Claude), Google Gemini, and xAI (Grok) via --provider.
<details> <summary>Extraction details</summary>
Provider and model selection
markcrawl-extract --jsonl ... --fields pricing --provider openai # default
markcrawl-extract --jsonl ... --fields pricing --provider anthropic # Claude
markcrawl-extract --jsonl ... --fields pricing --provider gemini # Gemini
markcrawl-extract --jsonl ... --fields pricing --provider grok # Grok
markcrawl-extract --jsonl ... --fields pricing --model gpt-4o # override model
| Provider | API key env var | Default model |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
gpt-4o-mini |
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
| Google Gemini | GEMINI_API_KEY |
gemini-2.0-flash |
| xAI (Grok) | XAI_API_KEY |
grok-3-mini-fast |
All extraction CLI arguments
| Argument | Description |
|---|---|
--jsonl |
Path(s) to pages.jsonl β pass multiple for cross-site analysis |
--fields |
Field names to extract (space-separated) |
--auto-fields |
Auto-discover fields by sampling pages |
--context |
Describe your goal for auto-discovery |
--sample-size |
Pages to sample for auto-discovery (default: 3) |
--provider |
openai, anthropic, gemini, or grok |
--model |
Override the default model |
--output |
Output path (default: extracted.jsonl) |
--delay |
Delay between LLM calls in seconds (default: 0.25) |
--show-progress |
Print progress |
Output format
Extracted rows include LLM attribution:
{
"url": "https://competitor.com/pricing",
"citation": "Pricing. competitor.com. Available at: ... [Accessed April 04, 2026].",
"pricing_tiers": "Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)",
"extracted_by": "gpt-4o-mini (openai)",
"extraction_note": "Field values were extracted by an LLM and may be interpreted, not verbatim."
}
</details>
Optional: Supabase vector search (RAG)
Chunk pages, generate embeddings, and upload to Supabase with pgvector:
pip install markcrawl[upload]
markcrawl --base https://docs.example.com --out ./output --show-progress
markcrawl-upload --jsonl ./output/pages.jsonl --show-progress
Requires SUPABASE_URL, SUPABASE_KEY, and OPENAI_API_KEY. See docs/SUPABASE.md for table setup, query examples, and recommendations.
Optional: agent integrations
MarkCrawl includes integrations for AI agents. Each is an optional add-on.
<details> <summary>MCP Server (Claude Desktop, Cursor, Windsurf)</summary>
pip install markcrawl[mcp]
{
"mcpServers": {
"markcrawl": {
"command": "python",
"args": ["-m", "markcrawl.mcp_server"]
}
}
}
Tools: crawl_site, list_pages, read_page, search_pages, extract_data
</details>
<details> <summary>LangChain Tool</summary>
pip install markcrawl[langchain]
from markcrawl.langchain import all_tools
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
agent = initialize_agent(tools=all_tools, llm=ChatOpenAI(model="gpt-4o-mini"),
agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION)
agent.run("Crawl docs.example.com and summarize their auth guide")
</details>
<details> <summary>OpenClaw Skill (WhatsApp, Telegram, Slack)</summary>
npx clawhub install markcrawl-skill
See AIMLPM/markcrawl-clawhub-skill. </details>
<details> <summary>LLM assistant prompt</summary>
Copy the system prompt from docs/LLM_PROMPT.md into any LLM to get an assistant that generates correct MarkCrawl commands. </details>
When NOT to use MarkCrawl
- Sites behind login/auth β no cookie or session support
- Aggressive bot protection (Cloudflare, Akamai) β no anti-bot evasion
- Millions of pages β designed for hundreds to low thousands; use Scrapy for scale
- PDF content β HTML only (PDF support is on the roadmap)
- JavaScript SPAs without
--render-jsβ addmarkcrawl[js]for React/Vue/Angular
Architecture
MarkCrawl is a web crawler. The optional layers (extraction, upload, agents) are separate add-ons that work with the crawler's output.
CORE (free, no API keys) OPTIONAL ADD-ONS
ββββββββββββββββββββββββββββ
β 1. Discover URLs β markcrawl[extract] β LLM field extraction
β (sitemap or links) β markcrawl[upload] β Supabase/pgvector RAG
β 2. Fetch & clean HTML β markcrawl[js] β Playwright JS rendering
β 3. Write Markdown + JSONLβ markcrawl[mcp] β MCP server for agents
β + auto-citation β markcrawl[langchain] β LangChain tools
ββββββββββββββββββββββββββββ
For internals, see docs/ARCHITECTURE.md.
Extending MarkCrawl
from markcrawl import crawl
result = crawl("https://example.com", out_dir="./output")
print(f"Saved {result.pages_saved} pages")
# Process output in your own pipeline
import json
with open(result.index_file) as f:
for line in f:
page = json.loads(line)
your_db.insert(page) # Pinecone, Weaviate, Elasticsearch, etc.
# Use individual components
from markcrawl import chunk_text
from markcrawl.extract import LLMClient, extract_fields
See docs/ARCHITECTURE.md for the full module map and extensibility guide.
Cost
The core crawler is free. Two optional features have API costs:
| Feature | Cost | When |
|---|---|---|
| Structured extraction | ~$0.01-0.03 per page | markcrawl-extract |
| Supabase upload | ~$0.0001 per page | markcrawl-upload |
Setting up API keys
Only needed for extraction and upload. The core crawler requires no keys.
# .env β in your working directory
OPENAI_API_KEY="sk-..." # extraction (--provider openai) + upload
ANTHROPIC_API_KEY="sk-ant-..." # extraction (--provider anthropic)
GEMINI_API_KEY="AI..." # extraction (--provider gemini)
XAI_API_KEY="xai-..." # extraction (--provider grok)
SUPABASE_URL="https://..." # upload
SUPABASE_KEY="eyJ..." # upload (service-role key)
source .env
<details> <summary>Project structure</summary>
.
βββ README.md
βββ LICENSE
βββ PRIVACY.md
βββ SECURITY.md
βββ CONTRIBUTING.md
βββ CODE_OF_CONDUCT.md
βββ Dockerfile
βββ glama.json
βββ pyproject.toml
βββ requirements.txt
βββ .github/
β βββ pull_request_template.md
β βββ workflows/
β βββ ci.yml
β βββ publish.yml
βββ docs/
β βββ ARCHITECTURE.md
β βββ LLM_PROMPT.md
β βββ MCP_SUBMISSION.md
β βββ SUPABASE.md
βββ tests/
β βββ test_core.py
β βββ test_chunker.py
β βββ test_extract.py
β βββ test_upload.py
βββ markcrawl/
βββ __init__.py
βββ cli.py
βββ core.py
βββ chunker.py
βββ exceptions.py
βββ utils.py
βββ extract.py
βββ extract_cli.py
βββ upload.py
βββ upload_cli.py
βββ langchain.py
βββ mcp_server.py
</details>
Roadmap
- [ ] Canonical URL support
- [ ] Fuzzy duplicate-content detection
- [ ] PDF support
- [ ] Authenticated crawling
- [ ] Multi-provider embeddings
<details> <summary>Shipped features</summary>
pip install markcrawlon PyPI- 102 automated tests + GitHub Actions CI (Python 3.10-3.13) + ruff linting
- Markdown and plain text output with auto-citation
- Sitemap-first crawling with robots.txt compliance
- Text chunking with configurable overlap
- Supabase/pgvector upload for RAG
- JavaScript rendering via Playwright
- Concurrent fetching and proxy support
- Resume interrupted crawls
- LLM extraction (OpenAI, Claude, Gemini, Grok) with auto-field discovery
- MCP server, LangChain tools, OpenClaw skill </details>
Contributing
See CONTRIBUTING.md. If you used an LLM to generate code, include the prompt in your PR.
Security
See SECURITY.md.
Privacy
MarkCrawl runs locally. No telemetry, no analytics, no data sent anywhere. See PRIVACY.md.
License
MIT. See LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.