solocrawl
Enables web search, scraping, and live package version lookup for local LLMs, with no API keys required.
README
<p align="center"> <img src="assets/icon/solocrawl.png" alt="SoloCrawl logo" width="128"> </p>
<h1 align="center">SoloCrawl</h1>
<p align="center"> <strong>Web search, scraping, and live package-version lookup β<br> for your terminal, your code, and your local LLM.</strong><br> No accounts. No API keys. No cloud. </p>
<p align="center"> <img src="https://img.shields.io/badge/python-3.14%2B-blue.svg" alt="Python 3.14+"> <img src="https://img.shields.io/badge/license-MIT-green.svg" alt="MIT License"> <img src="https://img.shields.io/badge/MCP-ready-8A2BE2.svg" alt="MCP ready"> <img src="https://img.shields.io/badge/API%20keys-none%20required-success.svg" alt="No API keys required"> </p>
SoloCrawl is a small, self-hosted, fully async Python tool that does three things well: it searches the web across many sources at once, scrapes pages into clean markdown that's ready for an LLM, and looks up the latest package version from official registries. It runs on your machine, for free, and works the moment you install it β nothing to sign up for, no keys to paste.
It's built for individual developers and people tinkering with local LLMs. Use it as an MCP server (LM Studio, Claude Desktop, OpenCode, β¦), straight from the CLI, or as a Python library.
β¨ Highlights
| π Zero config | Search, scrape, and package lookup all work out of the box β no accounts, no keys. |
| π Federated search | Queries 11 possible sources, merges them with Reciprocal Rank Fusion and de-duplicates URLs into one clean ranking. |
| π Smart scraping | HTML β tidy markdown via trafilatura/readability, with a Playwright browser fallback for JS-heavy pages β used only when it's actually needed. |
| π¦ Live package versions | 10 ecosystems (PyPI, npm, crates.io, Maven, Go, β¦) resolved live from official registries β never a stale local DB. |
| π€ MCP-native | Drops straight into local LLM tooling as a stdio MCP server with five ready tools. |
| β‘ Fully async, bounded | One shared HTTP client, one recycled browser, global + per-domain concurrency limits. Fast without hammering anyone. |
| π§© Hackable | Add a search or package provider as a single self-registering file β the core stays untouched. |
| π Safe by default | Blocks localhost/private/cloud-metadata targets and honours robots.txt. |
π Quick start
The recommended way to install SoloCrawl is pipx β it drops the solocrawl and solocrawl-mcp
commands onto your PATH in their own isolated environment, so you can run them from anywhere without
juggling a virtualenv:
git clone https://github.com/hlavacm/solocrawl.git
pipx install ./solocrawl # or an absolute path: pipx install /path/to/solocrawl
Already on PyPI? Then it's just
pipx install solocrawlβ no checkout needed.
That's it β now run the three core commands from any directory:
# Scrape a page to markdown
solocrawl scrape https://example.com
# Federated web search (Wikipedia + DuckDuckGo + StackExchange by default)
solocrawl search "python asyncio semaphore" --limit 5
# Live package version lookup
solocrawl package requests --ecosystem pypi
Updating
# Installed from PyPI:
pipx upgrade solocrawl
# Installed from a local checkout β pull the latest changes, then reinstall:
cd /path/to/solocrawl && git pull && pipx install --force . # alias: pipx reinstall solocrawl
After upgrading, restart your MCP client (LM Studio, Cursor, Claude Desktop, β¦) so it picks up the new
solocrawl-mcp binary β your mcp.json needs no changes as long as it points at solocrawl-mcp on
your PATH.
What you can do
π Search the web
One query, many sources, a single merged ranking β no single provider deciding everything for you.
solocrawl search "python asyncio semaphore" --limit 5
# Pick exactly which sources to hit, and get machine-readable output
solocrawl search "django orm" --sources wikipedia,stackexchange --json
π Scrape a page to clean markdown
Turn any URL into LLM-ready markdown with page metadata (title, author, date, β¦) as front-matter.
solocrawl scrape https://example.com
# Save to a file, or force the browser for a JS-rendered page
solocrawl scrape https://example.com --out page.md
solocrawl scrape https://example.com --force-browser
π¦ Look up package versions
The current version β and the one matching your constraint β straight from the official registry.
solocrawl package react --ecosystem npm --constraint ">=18,<19"
solocrawl package monolog/monolog --ecosystem packagist --json
solocrawl package some-lib --ecosystem pypi --allow-prerelease
π§ͺ Research in one shot
The classic LLM workflow β search, scrape the top hits, and get back one aggregated, cited report.
solocrawl research "python asyncio semaphore" --depth 3
ποΈ Batch-scrape many URLs
Fetch a whole list at once under the same bounded concurrency; --out-dir writes one file per URL.
solocrawl batch https://example.com https://www.python.org --out-dir /tmp/scrape
solocrawl batch --from-file urls.txt --out-dir /tmp/scrape
β¦and see what's available
# List every registered provider (search + package), default vs. opt-in
solocrawl providers
π€ Use it with your local LLM (MCP)
This is where SoloCrawl really shines β give your local model (LM Studio, OpenCode, Claude Desktop, β¦)
the ability to search, scrape, and check versions. The pipx install from the
Quick start already put solocrawl-mcp on your PATH, so all that's left is pointing
your MCP client at it.
LM Studio / Claude Desktop β ready-to-use config at examples/mcp.json.
Drop it into your client's MCP settings (mcp.json):
{
"mcpServers": {
"solocrawl": {
"command": "solocrawl-mcp",
"args": [],
"env": {
"SOLOCRAWL_LOG_LEVEL": "INFO",
"SOLOCRAWL_LOG_FILE": "~/.local/state/solocrawl/mcp.log"
}
}
}
}
OpenCode β uses a different config format. Copy
examples/opencode.jsonc into ~/.config/opencode/opencode.jsonc
(global) or opencode.jsonc in your project root. OpenCode expects type: "local", command as an
array, and environment instead of env.
If your MCP client doesn't inherit your shell PATH, replace "solocrawl-mcp" with the full path
from which solocrawl-mcp (typically ~/.local/bin/solocrawl-mcp after pipx install). Logs go to
stderr (visible in LM Studio Developer Logs) and optionally to SOLOCRAWL_LOG_FILE.
The server exposes five tools:
web_search(query, limit=5, sources=None)β federated search across enabled providersscrape(url)β fetch and extract markdown (with page metadata) from a URLresearch(query, depth=3)β search, scrape the top results, and return an aggregated cited reportpackage_version(name, ecosystem, constraint=None, allow_prerelease=False)β live registry lookuplist_providers(provider_type="all")β list registered search/package providers (default vs. opt-in)
To check the active version and command path:
pipx list | grep solocrawl
which solocrawl-mcp
Working from a local clone? A
pipx-installedsolocrawl-mcpis a snapshot β editing the repo does not update the command on yourPATH, so your MCP client keeps running the old code. After changing the source, refresh it withpipx install --force .(or install once withpipx install --editable .so future edits are picked up automatically).
π Use it from Python
import asyncio
from solocrawl.config import load_config
from solocrawl.core.search import federated_search, select_providers
from solocrawl.core.search.providers import duckduckgo, stackexchange, wikipedia # noqa: F401
async def main() -> None:
providers = select_providers(load_config())
results = await federated_search(providers, "asyncio python", limit=3)
for result in results:
print(result.title, result.url)
asyncio.run(main())
See examples/library_search.py for a runnable example.
Search providers
Default (zero-config, always enabled):
| Provider | Source |
|---|---|
wikipedia |
MediaWiki API |
duckduckgo |
ddgs package |
stackexchange |
Stack Exchange API (Stack Overflow) |
Opt-in (enable with SOLOCRAWL_ENABLE_PROVIDERS):
| Provider | Source |
|---|---|
wikidata |
Wikidata entity search |
hackernews |
Hacker News (Algolia) |
arxiv |
arXiv Atom API |
pubmed |
PubMed/NCBI E-utilities |
github |
GitHub repository search |
mdn |
MDN Web Docs search |
reddit |
Reddit post search (search.json) |
searxng |
Self-hosted SearXNG (set SOLOCRAWL_SEARXNG_URL) |
SOLOCRAWL_ENABLE_PROVIDERS=arxiv,hackernews solocrawl search "transformer attention" --limit 6
SOLOCRAWL_ENABLE_PROVIDERS=github,mdn solocrawl search "fetch api" --limit 6
Package registries
Default ecosystems: PyPI, npm, Packagist, crates.io, NuGet, Maven Central,
RubyGems, Go modules, pub.dev, Swift. Versions are always fetched live from official
registries β SoloCrawl does not maintain its own version database. Swift packages have no central
registry, so versions come from the repository's git tags (owner/repo on GitHub).
solocrawl package serde --ecosystem crates
solocrawl package Newtonsoft.Json --ecosystem nuget
solocrawl package org.junit.jupiter:junit-jupiter --ecosystem maven
solocrawl package github.com/gorilla/mux --ecosystem go
solocrawl package apple/swift-argument-parser --ecosystem swift
Optional extras
# Browser fallback for JS-heavy pages (Playwright)
pip install -e ".[browser]"
playwright install chromium
solocrawl scrape https://example.com --force-browser
# Install everything
pip install -e ".[all]"
βοΈ Configuration
All defaults work with no configuration. Everything below is optional and uses the SOLOCRAWL_
prefix. For local development, copy .env.dist to .env and uncomment what you need β SoloCrawl
loads .env automatically via python-dotenv, and existing
shell environment variables take precedence.
<details> <summary>Environment variables</summary>
| Variable | Default | Purpose |
|---|---|---|
SOLOCRAWL_ENABLE_PROVIDERS |
(empty) | Comma-separated opt-in provider names |
SOLOCRAWL_SEARXNG_URL |
(empty) | Base URL of a self-hosted SearXNG instance (enables the searxng provider) |
SOLOCRAWL_RESPECT_ROBOTS |
true |
Honour robots.txt on scrape (fail-open); set false to skip |
SOLOCRAWL_CACHE_TTL_SECONDS |
0 |
In-memory fetch cache TTL in seconds (0 = disabled) |
SOLOCRAWL_MAX_CONCURRENCY |
10 |
Global fetch concurrency limit |
SOLOCRAWL_PER_DOMAIN_LIMIT |
2 |
Per-domain concurrency limit |
SOLOCRAWL_TIMEOUT_SECONDS |
30 |
Per-request timeout in seconds |
SOLOCRAWL_MAX_RETRIES |
3 |
Retries on network errors / rate limits |
SOLOCRAWL_MAX_RESPONSE_BYTES |
10485760 |
Cap on fetched response body size (10 MiB); larger bodies are truncated |
SOLOCRAWL_PROXY_ENABLED |
false |
Enable optional proxy layer |
SOLOCRAWL_PROXY_MODE |
list |
Proxy mode: list (rotate a pool) or endpoint (single rotating endpoint) |
SOLOCRAWL_PROXY_LIST |
(empty) | Comma-separated proxy URLs |
SOLOCRAWL_PROXY_ENDPOINT |
(empty) | Single rotating proxy endpoint |
SOLOCRAWL_PROXY_USERNAME |
(empty) | Proxy auth username |
SOLOCRAWL_PROXY_PASSWORD |
(empty) | Proxy auth password |
SOLOCRAWL_ALLOW_INTERNAL_URLS |
false |
Allow scraping localhost/private IPs (dev only) |
SOLOCRAWL_USER_AGENT |
(SoloCrawl default) | Override HTTP User-Agent for API requests |
SOLOCRAWL_BROWSER_ALLOWED |
true |
Allow Playwright fallback when installed |
SOLOCRAWL_LOG_LEVEL |
WARNING |
Log level: DEBUG, INFO, WARNING, ERROR |
SOLOCRAWL_LOG_FILE |
(empty) | Optional log file path (also logs to stderr) |
</details>
π Security note on URL fetching
By default SoloCrawl refuses to fetch localhost, link-local, private, reserved, and
cloud-metadata addresses. It checks literal hosts, DNS-resolved A/AAAA records, HTTP redirect
targets, and Playwright's final browser URL. SoloCrawl is still a single-user local tool, not a
hostile-multi-tenant proxy β do not expose it to untrusted network callers.
SOLOCRAWL_ALLOW_INTERNAL_URLS=true disables these internal-target checks entirely (intended for
trusted local development only).
π§© Extending it
The whole point of the plugin layout is that adding a source is a single self-registering file β the core never changes. To add a search provider:
- Create
src/solocrawl/core/search/providers/myprovider.pyimplementingSearchProvider. - Register with
@register("myprovider", zero_config=True)or as opt-in. - Import the module in
src/solocrawl/core/search/providers/__init__.pyso registration runs. - Add fixture-based tests in
tests/.
The same pattern applies to package providers under src/solocrawl/core/packages/providers/.
Development
Work from a checkout in a virtualenv with an editable install β this also drops the solocrawl and
solocrawl-mcp scripts into .venv/bin/:
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
Then run the quality gate:
ruff check . && ruff format --check .
pyright
pytest
# β¦or all in one line:
ruff check . && pyright && pytest
Ethics and terms of use
SoloCrawl is built for individual developers and local LLM tooling. It respects the robots.txt and
terms of service of target sites β scrape consults robots.txt and refuses disallowed URLs by
default (fail-open on errors; opt out with SOLOCRAWL_RESPECT_ROBOTS=false). The proxy and scraping
features are not intended to bypass site rules, captchas, or anti-bot systems. Use responsibly and
stay within legitimate access patterns.
License
MIT β see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.