solocrawl

solocrawl

Enables web search, scraping, and live package version lookup for local LLMs, with no API keys required.

Category
Visit Server

README

<p align="center"> <img src="assets/icon/solocrawl.png" alt="SoloCrawl logo" width="128"> </p>

<h1 align="center">SoloCrawl</h1>

<p align="center"> <strong>Web search, scraping, and live package-version lookup β€”<br> for your terminal, your code, and your local LLM.</strong><br> No accounts. No API keys. No cloud. </p>

<p align="center"> <img src="https://img.shields.io/badge/python-3.14%2B-blue.svg" alt="Python 3.14+"> <img src="https://img.shields.io/badge/license-MIT-green.svg" alt="MIT License"> <img src="https://img.shields.io/badge/MCP-ready-8A2BE2.svg" alt="MCP ready"> <img src="https://img.shields.io/badge/API%20keys-none%20required-success.svg" alt="No API keys required"> </p>


SoloCrawl is a small, self-hosted, fully async Python tool that does three things well: it searches the web across many sources at once, scrapes pages into clean markdown that's ready for an LLM, and looks up the latest package version from official registries. It runs on your machine, for free, and works the moment you install it β€” nothing to sign up for, no keys to paste.

It's built for individual developers and people tinkering with local LLMs. Use it as an MCP server (LM Studio, Claude Desktop, OpenCode, …), straight from the CLI, or as a Python library.

✨ Highlights

πŸ”Œ Zero config Search, scrape, and package lookup all work out of the box β€” no accounts, no keys.
πŸ”Ž Federated search Queries 11 possible sources, merges them with Reciprocal Rank Fusion and de-duplicates URLs into one clean ranking.
πŸ“„ Smart scraping HTML β†’ tidy markdown via trafilatura/readability, with a Playwright browser fallback for JS-heavy pages β€” used only when it's actually needed.
πŸ“¦ Live package versions 10 ecosystems (PyPI, npm, crates.io, Maven, Go, …) resolved live from official registries β€” never a stale local DB.
πŸ€– MCP-native Drops straight into local LLM tooling as a stdio MCP server with five ready tools.
⚑ Fully async, bounded One shared HTTP client, one recycled browser, global + per-domain concurrency limits. Fast without hammering anyone.
🧩 Hackable Add a search or package provider as a single self-registering file β€” the core stays untouched.
πŸ”’ Safe by default Blocks localhost/private/cloud-metadata targets and honours robots.txt.

πŸš€ Quick start

The recommended way to install SoloCrawl is pipx β€” it drops the solocrawl and solocrawl-mcp commands onto your PATH in their own isolated environment, so you can run them from anywhere without juggling a virtualenv:

git clone https://github.com/hlavacm/solocrawl.git
pipx install ./solocrawl     # or an absolute path: pipx install /path/to/solocrawl

Already on PyPI? Then it's just pipx install solocrawl β€” no checkout needed.

That's it β€” now run the three core commands from any directory:

# Scrape a page to markdown
solocrawl scrape https://example.com

# Federated web search (Wikipedia + DuckDuckGo + StackExchange by default)
solocrawl search "python asyncio semaphore" --limit 5

# Live package version lookup
solocrawl package requests --ecosystem pypi

Updating

# Installed from PyPI:
pipx upgrade solocrawl

# Installed from a local checkout β€” pull the latest changes, then reinstall:
cd /path/to/solocrawl && git pull && pipx install --force .   # alias: pipx reinstall solocrawl

After upgrading, restart your MCP client (LM Studio, Cursor, Claude Desktop, …) so it picks up the new solocrawl-mcp binary β€” your mcp.json needs no changes as long as it points at solocrawl-mcp on your PATH.

What you can do

πŸ”Ž Search the web

One query, many sources, a single merged ranking β€” no single provider deciding everything for you.

solocrawl search "python asyncio semaphore" --limit 5

# Pick exactly which sources to hit, and get machine-readable output
solocrawl search "django orm" --sources wikipedia,stackexchange --json

πŸ“„ Scrape a page to clean markdown

Turn any URL into LLM-ready markdown with page metadata (title, author, date, …) as front-matter.

solocrawl scrape https://example.com

# Save to a file, or force the browser for a JS-rendered page
solocrawl scrape https://example.com --out page.md
solocrawl scrape https://example.com --force-browser

πŸ“¦ Look up package versions

The current version β€” and the one matching your constraint β€” straight from the official registry.

solocrawl package react --ecosystem npm --constraint ">=18,<19"
solocrawl package monolog/monolog --ecosystem packagist --json
solocrawl package some-lib --ecosystem pypi --allow-prerelease

πŸ§ͺ Research in one shot

The classic LLM workflow β€” search, scrape the top hits, and get back one aggregated, cited report.

solocrawl research "python asyncio semaphore" --depth 3

πŸ—‚οΈ Batch-scrape many URLs

Fetch a whole list at once under the same bounded concurrency; --out-dir writes one file per URL.

solocrawl batch https://example.com https://www.python.org --out-dir /tmp/scrape
solocrawl batch --from-file urls.txt --out-dir /tmp/scrape

…and see what's available

# List every registered provider (search + package), default vs. opt-in
solocrawl providers

πŸ€– Use it with your local LLM (MCP)

This is where SoloCrawl really shines β€” give your local model (LM Studio, OpenCode, Claude Desktop, …) the ability to search, scrape, and check versions. The pipx install from the Quick start already put solocrawl-mcp on your PATH, so all that's left is pointing your MCP client at it.

LM Studio / Claude Desktop β€” ready-to-use config at examples/mcp.json. Drop it into your client's MCP settings (mcp.json):

{
  "mcpServers": {
    "solocrawl": {
      "command": "solocrawl-mcp",
      "args": [],
      "env": {
        "SOLOCRAWL_LOG_LEVEL": "INFO",
        "SOLOCRAWL_LOG_FILE": "~/.local/state/solocrawl/mcp.log"
      }
    }
  }
}

OpenCode β€” uses a different config format. Copy examples/opencode.jsonc into ~/.config/opencode/opencode.jsonc (global) or opencode.jsonc in your project root. OpenCode expects type: "local", command as an array, and environment instead of env.

If your MCP client doesn't inherit your shell PATH, replace "solocrawl-mcp" with the full path from which solocrawl-mcp (typically ~/.local/bin/solocrawl-mcp after pipx install). Logs go to stderr (visible in LM Studio Developer Logs) and optionally to SOLOCRAWL_LOG_FILE.

The server exposes five tools:

  • web_search(query, limit=5, sources=None) β€” federated search across enabled providers
  • scrape(url) β€” fetch and extract markdown (with page metadata) from a URL
  • research(query, depth=3) β€” search, scrape the top results, and return an aggregated cited report
  • package_version(name, ecosystem, constraint=None, allow_prerelease=False) β€” live registry lookup
  • list_providers(provider_type="all") β€” list registered search/package providers (default vs. opt-in)

To check the active version and command path:

pipx list | grep solocrawl
which solocrawl-mcp

Working from a local clone? A pipx-installed solocrawl-mcp is a snapshot β€” editing the repo does not update the command on your PATH, so your MCP client keeps running the old code. After changing the source, refresh it with pipx install --force . (or install once with pipx install --editable . so future edits are picked up automatically).

🐍 Use it from Python

import asyncio

from solocrawl.config import load_config
from solocrawl.core.search import federated_search, select_providers
from solocrawl.core.search.providers import duckduckgo, stackexchange, wikipedia  # noqa: F401

async def main() -> None:
    providers = select_providers(load_config())
    results = await federated_search(providers, "asyncio python", limit=3)
    for result in results:
        print(result.title, result.url)

asyncio.run(main())

See examples/library_search.py for a runnable example.

Search providers

Default (zero-config, always enabled):

Provider Source
wikipedia MediaWiki API
duckduckgo ddgs package
stackexchange Stack Exchange API (Stack Overflow)

Opt-in (enable with SOLOCRAWL_ENABLE_PROVIDERS):

Provider Source
wikidata Wikidata entity search
hackernews Hacker News (Algolia)
arxiv arXiv Atom API
pubmed PubMed/NCBI E-utilities
github GitHub repository search
mdn MDN Web Docs search
reddit Reddit post search (search.json)
searxng Self-hosted SearXNG (set SOLOCRAWL_SEARXNG_URL)
SOLOCRAWL_ENABLE_PROVIDERS=arxiv,hackernews solocrawl search "transformer attention" --limit 6
SOLOCRAWL_ENABLE_PROVIDERS=github,mdn solocrawl search "fetch api" --limit 6

Package registries

Default ecosystems: PyPI, npm, Packagist, crates.io, NuGet, Maven Central, RubyGems, Go modules, pub.dev, Swift. Versions are always fetched live from official registries β€” SoloCrawl does not maintain its own version database. Swift packages have no central registry, so versions come from the repository's git tags (owner/repo on GitHub).

solocrawl package serde --ecosystem crates
solocrawl package Newtonsoft.Json --ecosystem nuget
solocrawl package org.junit.jupiter:junit-jupiter --ecosystem maven
solocrawl package github.com/gorilla/mux --ecosystem go
solocrawl package apple/swift-argument-parser --ecosystem swift

Optional extras

# Browser fallback for JS-heavy pages (Playwright)
pip install -e ".[browser]"
playwright install chromium

solocrawl scrape https://example.com --force-browser

# Install everything
pip install -e ".[all]"

βš™οΈ Configuration

All defaults work with no configuration. Everything below is optional and uses the SOLOCRAWL_ prefix. For local development, copy .env.dist to .env and uncomment what you need β€” SoloCrawl loads .env automatically via python-dotenv, and existing shell environment variables take precedence.

<details> <summary>Environment variables</summary>

Variable Default Purpose
SOLOCRAWL_ENABLE_PROVIDERS (empty) Comma-separated opt-in provider names
SOLOCRAWL_SEARXNG_URL (empty) Base URL of a self-hosted SearXNG instance (enables the searxng provider)
SOLOCRAWL_RESPECT_ROBOTS true Honour robots.txt on scrape (fail-open); set false to skip
SOLOCRAWL_CACHE_TTL_SECONDS 0 In-memory fetch cache TTL in seconds (0 = disabled)
SOLOCRAWL_MAX_CONCURRENCY 10 Global fetch concurrency limit
SOLOCRAWL_PER_DOMAIN_LIMIT 2 Per-domain concurrency limit
SOLOCRAWL_TIMEOUT_SECONDS 30 Per-request timeout in seconds
SOLOCRAWL_MAX_RETRIES 3 Retries on network errors / rate limits
SOLOCRAWL_MAX_RESPONSE_BYTES 10485760 Cap on fetched response body size (10 MiB); larger bodies are truncated
SOLOCRAWL_PROXY_ENABLED false Enable optional proxy layer
SOLOCRAWL_PROXY_MODE list Proxy mode: list (rotate a pool) or endpoint (single rotating endpoint)
SOLOCRAWL_PROXY_LIST (empty) Comma-separated proxy URLs
SOLOCRAWL_PROXY_ENDPOINT (empty) Single rotating proxy endpoint
SOLOCRAWL_PROXY_USERNAME (empty) Proxy auth username
SOLOCRAWL_PROXY_PASSWORD (empty) Proxy auth password
SOLOCRAWL_ALLOW_INTERNAL_URLS false Allow scraping localhost/private IPs (dev only)
SOLOCRAWL_USER_AGENT (SoloCrawl default) Override HTTP User-Agent for API requests
SOLOCRAWL_BROWSER_ALLOWED true Allow Playwright fallback when installed
SOLOCRAWL_LOG_LEVEL WARNING Log level: DEBUG, INFO, WARNING, ERROR
SOLOCRAWL_LOG_FILE (empty) Optional log file path (also logs to stderr)

</details>

πŸ”’ Security note on URL fetching

By default SoloCrawl refuses to fetch localhost, link-local, private, reserved, and cloud-metadata addresses. It checks literal hosts, DNS-resolved A/AAAA records, HTTP redirect targets, and Playwright's final browser URL. SoloCrawl is still a single-user local tool, not a hostile-multi-tenant proxy β€” do not expose it to untrusted network callers. SOLOCRAWL_ALLOW_INTERNAL_URLS=true disables these internal-target checks entirely (intended for trusted local development only).

🧩 Extending it

The whole point of the plugin layout is that adding a source is a single self-registering file β€” the core never changes. To add a search provider:

  1. Create src/solocrawl/core/search/providers/myprovider.py implementing SearchProvider.
  2. Register with @register("myprovider", zero_config=True) or as opt-in.
  3. Import the module in src/solocrawl/core/search/providers/__init__.py so registration runs.
  4. Add fixture-based tests in tests/.

The same pattern applies to package providers under src/solocrawl/core/packages/providers/.

Development

Work from a checkout in a virtualenv with an editable install β€” this also drops the solocrawl and solocrawl-mcp scripts into .venv/bin/:

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Then run the quality gate:

ruff check . && ruff format --check .
pyright
pytest

# …or all in one line:
ruff check . && pyright && pytest

Ethics and terms of use

SoloCrawl is built for individual developers and local LLM tooling. It respects the robots.txt and terms of service of target sites β€” scrape consults robots.txt and refuses disallowed URLs by default (fail-open on errors; opt out with SOLOCRAWL_RESPECT_ROBOTS=false). The proxy and scraping features are not intended to bypass site rules, captchas, or anti-bot systems. Use responsibly and stay within legitimate access patterns.

License

MIT β€” see LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured