MCP Servers

solocrawl

Enables web search, scraping, and live package version lookup for local LLMs, with no API keys required.

README

<h1 align="center">SoloCrawl</h1>

Web search, scraping, and live package-version lookup — for your terminal, your code, and your local LLM. No accounts. No API keys. No cloud.

SoloCrawl is a small, self-hosted, fully async Python tool that does three things well: it searches the web across many sources at once, scrapes pages into clean markdown that's ready for an LLM, and looks up the latest package version from official registries. It runs on your machine, for free, and works the moment you install it — nothing to sign up for, no keys to paste.

It's built for individual developers and people tinkering with local LLMs. Use it as an MCP server (LM Studio, Claude Desktop, OpenCode, …), straight from the CLI, or as a Python library.

✨ Highlights


🔌 Zero config	Search, scrape, and package lookup all work out of the box — no accounts, no keys.
🔎 Federated search	Queries 11 possible sources, merges them with Reciprocal Rank Fusion and de-duplicates URLs into one clean ranking.
📄 Smart scraping	HTML → tidy markdown via `trafilatura`/`readability`, with a Playwright browser fallback for JS-heavy pages — used only when it's actually needed.
📦 Live package versions	10 ecosystems (PyPI, npm, crates.io, Maven, Go, …) resolved live from official registries — never a stale local DB.
🤖 MCP-native	Drops straight into local LLM tooling as a stdio MCP server with five ready tools.
⚡ Fully async, bounded	One shared HTTP client, one recycled browser, global + per-domain concurrency limits. Fast without hammering anyone.
🧩 Hackable	Add a search or package provider as a single self-registering file — the core stays untouched.
🔒 Safe by default	Blocks localhost/private/cloud-metadata targets and honours `robots.txt`.

🚀 Quick start

The recommended way to install SoloCrawl is pipx — it drops the solocrawl and solocrawl-mcp commands onto your PATH in their own isolated environment, so you can run them from anywhere without juggling a virtualenv:

git clone https://github.com/hlavacm/solocrawl.git
pipx install ./solocrawl     # or an absolute path: pipx install /path/to/solocrawl

Already on PyPI? Then it's just pipx install solocrawl — no checkout needed.

That's it — now run the three core commands from any directory:

# Scrape a page to markdown
solocrawl scrape https://example.com

# Federated web search (Wikipedia + DuckDuckGo + StackExchange by default)
solocrawl search "python asyncio semaphore" --limit 5

# Live package version lookup
solocrawl package requests --ecosystem pypi

Updating

# Installed from PyPI:
pipx upgrade solocrawl

# Installed from a local checkout — pull the latest changes, then reinstall:
cd /path/to/solocrawl && git pull && pipx install --force .   # alias: pipx reinstall solocrawl

After upgrading, restart your MCP client (LM Studio, Cursor, Claude Desktop, …) so it picks up the new solocrawl-mcp binary — your mcp.json needs no changes as long as it points at solocrawl-mcp on your PATH.

What you can do

🔎 Search the web

One query, many sources, a single merged ranking — no single provider deciding everything for you.

solocrawl search "python asyncio semaphore" --limit 5

# Pick exactly which sources to hit, and get machine-readable output
solocrawl search "django orm" --sources wikipedia,stackexchange --json

📄 Scrape a page to clean markdown

Turn any URL into LLM-ready markdown with page metadata (title, author, date, …) as front-matter.

solocrawl scrape https://example.com

# Save to a file, or force the browser for a JS-rendered page
solocrawl scrape https://example.com --out page.md
solocrawl scrape https://example.com --force-browser

📦 Look up package versions

The current version — and the one matching your constraint — straight from the official registry.

solocrawl package react --ecosystem npm --constraint ">=18,<19"
solocrawl package monolog/monolog --ecosystem packagist --json
solocrawl package some-lib --ecosystem pypi --allow-prerelease

🧪 Research in one shot

The classic LLM workflow — search, scrape the top hits, and get back one aggregated, cited report.

solocrawl research "python asyncio semaphore" --depth 3

🗂️ Batch-scrape many URLs

Fetch a whole list at once under the same bounded concurrency; --out-dir writes one file per URL.

solocrawl batch https://example.com https://www.python.org --out-dir /tmp/scrape
solocrawl batch --from-file urls.txt --out-dir /tmp/scrape

…and see what's available

# List every registered provider (search + package), default vs. opt-in
solocrawl providers

🤖 Use it with your local LLM (MCP)

This is where SoloCrawl really shines — give your local model (LM Studio, OpenCode, Claude Desktop, …) the ability to search, scrape, and check versions. The pipx install from the Quick start already put solocrawl-mcp on your PATH, so all that's left is pointing your MCP client at it.

LM Studio / Claude Desktop — ready-to-use config at examples/mcp.json. Drop it into your client's MCP settings (mcp.json):

{
  "mcpServers": {
    "solocrawl": {
      "command": "solocrawl-mcp",
      "args": [],
      "env": {
        "SOLOCRAWL_LOG_LEVEL": "INFO",
        "SOLOCRAWL_LOG_FILE": "~/.local/state/solocrawl/mcp.log"
      }
    }
  }
}

OpenCode — uses a different config format. Copy examples/opencode.jsonc into ~/.config/opencode/opencode.jsonc (global) or opencode.jsonc in your project root. OpenCode expects type: "local", command as an array, and environment instead of env.

If your MCP client doesn't inherit your shell PATH, replace "solocrawl-mcp" with the full path from which solocrawl-mcp (typically ~/.local/bin/solocrawl-mcp after pipx install). Logs go to stderr (visible in LM Studio Developer Logs) and optionally to SOLOCRAWL_LOG_FILE.

The server exposes five tools:

web_search(query, limit=5, sources=None) — federated search across enabled providers
scrape(url) — fetch and extract markdown (with page metadata) from a URL
research(query, depth=3) — search, scrape the top results, and return an aggregated cited report
package_version(name, ecosystem, constraint=None, allow_prerelease=False) — live registry lookup
list_providers(provider_type="all") — list registered search/package providers (default vs. opt-in)

To check the active version and command path:

pipx list | grep solocrawl
which solocrawl-mcp

Working from a local clone? A pipx-installed solocrawl-mcp is a snapshot — editing the repo does not update the command on your PATH, so your MCP client keeps running the old code. After changing the source, refresh it with pipx install --force . (or install once with pipx install --editable . so future edits are picked up automatically).

🐍 Use it from Python

import asyncio

from solocrawl.config import load_config
from solocrawl.core.search import federated_search, select_providers
from solocrawl.core.search.providers import duckduckgo, stackexchange, wikipedia  # noqa: F401

async def main() -> None:
    providers = select_providers(load_config())
    results = await federated_search(providers, "asyncio python", limit=3)
    for result in results:
        print(result.title, result.url)

asyncio.run(main())

See examples/library_search.py for a runnable example.

Search providers

Default (zero-config, always enabled):

Provider	Source
`wikipedia`	MediaWiki API
`duckduckgo`	`ddgs` package
`stackexchange`	Stack Exchange API (Stack Overflow)

Opt-in (enable with SOLOCRAWL_ENABLE_PROVIDERS):

Provider	Source
`wikidata`	Wikidata entity search
`hackernews`	Hacker News (Algolia)
`arxiv`	arXiv Atom API
`pubmed`	PubMed/NCBI E-utilities
`github`	GitHub repository search
`mdn`	MDN Web Docs search
`reddit`	Reddit post search (`search.json`)
`searxng`	Self-hosted SearXNG (set `SOLOCRAWL_SEARXNG_URL`)

SOLOCRAWL_ENABLE_PROVIDERS=arxiv,hackernews solocrawl search "transformer attention" --limit 6
SOLOCRAWL_ENABLE_PROVIDERS=github,mdn solocrawl search "fetch api" --limit 6

Package registries

Default ecosystems: PyPI, npm, Packagist, crates.io, NuGet, Maven Central, RubyGems, Go modules, pub.dev, Swift. Versions are always fetched live from official registries — SoloCrawl does not maintain its own version database. Swift packages have no central registry, so versions come from the repository's git tags (owner/repo on GitHub).

solocrawl package serde --ecosystem crates
solocrawl package Newtonsoft.Json --ecosystem nuget
solocrawl package org.junit.jupiter:junit-jupiter --ecosystem maven
solocrawl package github.com/gorilla/mux --ecosystem go
solocrawl package apple/swift-argument-parser --ecosystem swift

Optional extras

# Browser fallback for JS-heavy pages (Playwright)
pip install -e ".[browser]"
playwright install chromium

solocrawl scrape https://example.com --force-browser

# Install everything
pip install -e ".[all]"

⚙️ Configuration

All defaults work with no configuration. Everything below is optional and uses the SOLOCRAWL_ prefix. For local development, copy .env.dist to .env and uncomment what you need — SoloCrawl loads .env automatically via python-dotenv, and existing shell environment variables take precedence.

<details> <summary>Environment variables</summary>

Variable	Default	Purpose
`SOLOCRAWL_ENABLE_PROVIDERS`	(empty)	Comma-separated opt-in provider names
`SOLOCRAWL_SEARXNG_URL`	(empty)	Base URL of a self-hosted SearXNG instance (enables the `searxng` provider)
`SOLOCRAWL_RESPECT_ROBOTS`	`true`	Honour `robots.txt` on scrape (fail-open); set `false` to skip
`SOLOCRAWL_CACHE_TTL_SECONDS`	`0`	In-memory fetch cache TTL in seconds (`0` = disabled)
`SOLOCRAWL_MAX_CONCURRENCY`	`10`	Global fetch concurrency limit
`SOLOCRAWL_PER_DOMAIN_LIMIT`	`2`	Per-domain concurrency limit
`SOLOCRAWL_TIMEOUT_SECONDS`	`30`	Per-request timeout in seconds
`SOLOCRAWL_MAX_RETRIES`	`3`	Retries on network errors / rate limits
`SOLOCRAWL_MAX_RESPONSE_BYTES`	`10485760`	Cap on fetched response body size (10 MiB); larger bodies are truncated
`SOLOCRAWL_PROXY_ENABLED`	`false`	Enable optional proxy layer
`SOLOCRAWL_PROXY_MODE`	`list`	Proxy mode: `list` (rotate a pool) or `endpoint` (single rotating endpoint)
`SOLOCRAWL_PROXY_LIST`	(empty)	Comma-separated proxy URLs
`SOLOCRAWL_PROXY_ENDPOINT`	(empty)	Single rotating proxy endpoint
`SOLOCRAWL_PROXY_USERNAME`	(empty)	Proxy auth username
`SOLOCRAWL_PROXY_PASSWORD`	(empty)	Proxy auth password
`SOLOCRAWL_ALLOW_INTERNAL_URLS`	`false`	Allow scraping localhost/private IPs (dev only)
`SOLOCRAWL_USER_AGENT`	(SoloCrawl default)	Override HTTP User-Agent for API requests
`SOLOCRAWL_BROWSER_ALLOWED`	`true`	Allow Playwright fallback when installed
`SOLOCRAWL_LOG_LEVEL`	`WARNING`	Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`
`SOLOCRAWL_LOG_FILE`	(empty)	Optional log file path (also logs to stderr)

</details>

🔒 Security note on URL fetching

By default SoloCrawl refuses to fetch localhost, link-local, private, reserved, and cloud-metadata addresses. It checks literal hosts, DNS-resolved A/AAAA records, HTTP redirect targets, and Playwright's final browser URL. SoloCrawl is still a single-user local tool, not a hostile-multi-tenant proxy — do not expose it to untrusted network callers. SOLOCRAWL_ALLOW_INTERNAL_URLS=true disables these internal-target checks entirely (intended for trusted local development only).

🧩 Extending it

The whole point of the plugin layout is that adding a source is a single self-registering file — the core never changes. To add a search provider:

Create src/solocrawl/core/search/providers/myprovider.py implementing SearchProvider.
Register with @register("myprovider", zero_config=True) or as opt-in.
Import the module in src/solocrawl/core/search/providers/__init__.py so registration runs.
Add fixture-based tests in tests/.

The same pattern applies to package providers under src/solocrawl/core/packages/providers/.

Development

Work from a checkout in a virtualenv with an editable install — this also drops the solocrawl and solocrawl-mcp scripts into .venv/bin/:

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Then run the quality gate:

ruff check . && ruff format --check .
pyright
pytest

# …or all in one line:
ruff check . && pyright && pytest

Ethics and terms of use

SoloCrawl is built for individual developers and local LLM tooling. It respects the robots.txt and terms of service of target sites — scrape consults robots.txt and refuses disallowed URLs by default (fail-open on errors; opt out with SOLOCRAWL_RESPECT_ROBOTS=false). The proxy and scraping features are not intended to bypass site rules, captchas, or anti-bot systems. Use responsibly and stay within legitimate access patterns.

License

MIT — see LICENSE.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured