research-mcp
A stateless MCP server that provides web search, single page reading, and batch page reading tools, with automatic failover across multiple search and content extraction providers.
README
research-mcp
A stateless MCP facade that hides a pyramid of search/read providers behind a single streamable-http MCP endpoint and exposes just 3 clean tools with good Russian help texts. An LLM gets a simple "search → read" toolset; behind it, several providers are tried, merged, and failed over automatically.
The app does no authentication — it is published through Traefik + basicAuth
on the host. It holds no application state: the only thing persisted is a log
file under data/ (kept on a volume).
Tools
| Tool | What it does |
|---|---|
web_search(query, num_results=8, page=1, language=None) |
Search across all enabled providers, merge + dedup → ranked list (title, URL, snippet). Search only. |
read_page(url) |
One page or PDF → clean Markdown. Auto-detects type, walks the read pipeline (light → heavy) until one succeeds. |
read_pages(urls) |
Up to 20 urls concurrently → list of {url, ok, markdown|error}. |
Architecture: types + instances
Providers are plugins. We separate:
- type — an implementation class (e.g. the
searxngsearch provider), one per module insrc/providers/, registered with@register("type"). - instance — a configured copy of a type with its secrets/URL resolved from
named environment variables (multiple instances of one type are allowed,
e.g.
tavily-1/tavily-2with different keys).
Which instances exist and the order each pipeline tries them is configured in
code (src/pipeline_config.py); keys/URLs come from ENV by variable name.
- Search pipeline (
searxng → serper → exa): enabled instances run concurrently; results are merged and deduplicated by normalized URL (earlier pipeline position wins), then trimmed tonum_results. - Read pipeline (
trafilatura → jina → crawl4ai → tavily-1 → tavily-2 → firecrawl): a single probe GET classifies the url. PDFs (Content-Type /.pdf/%PDFmagic) are extracted with pypdf; for HTML, that same body is handed totrafilaturaso the hot path never GETs twice, then the remaining instances are tried in order and the first to return content>= FALLBACK_MIN_CHARSwins.
Cross-cutting: one transient retry (5xx / transport errors) with a short backoff;
402 (out of credits) / 429 (rate limited) are treated as a provider failure →
next instance (this is what makes tavily-1 → tavily-2 fail over).
An instance is enabled only if its required env var(s) are set; otherwise it
is skipped with a log line. trafilatura needs no config (always on); jina
works keyless (its key is optional). At startup the server requires at least one
search and one read instance, else it exits with a clear message.
Adding a provider
- Write
src/providers/<type>.pywith a class decorated@register("<type>")implementingSearchProvider.search(...)orReadProvider.read(...). - Import the module in
src/providers/__init__.py(so the decorator runs). - Add an
Instance("name", "<type>", api_key_env="YOUR_ENV_NAME")line insrc/pipeline_config.pyand reference itsnameinSEARCH_PIPELINE/READ_PIPELINE. Use the ENV var NAME, never a value. - Document the env var in
.env.example.
Quick start
make install # create .venv + install dev/test deps
cp .env.example .env # fill in the keys you have (shortcut: make env)
make test # run tests
make run # run the server (streamable-http on MCP_HOST:MCP_PORT, endpoint /mcp)
Configuration
All config comes from ENV / .env (see .env.example). Provider secrets/URLs
are read by name in the instance loader, not declared as Settings fields. The
non-secret knobs (all defaulted): MCP_HOST, MCP_PORT, LOG_LEVEL,
LOG_FILE, LOG_ROTATION, LOG_RETENTION, REQUEST_TIMEOUT,
FALLBACK_MIN_CHARS, READ_PAGES_CONCURRENCY, RETRIES. The read_pages
per-call url cap is a fixed 20 (hard constant, matching the tool description) —
not configurable.
Provider env vars: SEARXNG_URL, SERPER_API_KEY, EXA_API_KEY, JINA_API_KEY
(optional), CRAWL4AI_URL + CRAWL4AI_TOKEN, TAVILY_1_API_KEY,
TAVILY_2_API_KEY, FIRECRAWL_API_KEY.
Proxy
Any external instance can be routed through its own SOCKS5/HTTP proxy by
setting <INSTANCE>_PROXY — useful for clean egress past IP-based blocks (e.g.
Cloudflare in front of Exa). Supported per instance: EXA_PROXY, SERPER_PROXY,
JINA_PROXY, TAVILY_1_PROXY, TAVILY_2_PROXY, FIRECRAWL_PROXY. Internal
instances (searxng, crawl4ai, trafilatura) have no proxy.
The value is passed straight to httpx; socks5://host:port does proxy-side
DNS (the target hostname is resolved by the proxy, like curl --socks5-hostname), and socks5h:// / http://host:port are also accepted.
Unset → that instance goes direct. The pipeline keeps one pooled httpx client
per distinct proxy URL (and one direct client), selected per instance, so
proxied and direct providers run side by side. Needs the socks extra
(httpx[socks], already pinned).
Logging
Besides stderr (captured by Docker's rotation-capped json-file driver), the
server writes a persistent log file to data/research-mcp.log (default;
LOG_ROTATION=20 MB, LOG_RETENTION=14 days). It lives on the data/ volume,
so it survives container restarts and image updates. The file carries one
per-request line per tool call — search (query, which provider instances
actually ran, result count, latency) and read (url, the winning provider/tier
or pdf, ok, latency), plus a read_pages count=N ok=K summary — making it
useful for analyzing how requests distribute across provider tiers. No request
bodies or secrets are logged, only urls/queries, provider names, counts, timings.
Deployment
CI builds the image and pushes it to ghcr.io (test → build, tags latest +
sha). On prod we pull the prebuilt image via docker-compose.yml (behind
Traefik + basicAuth, watchtower auto-updates latest; the data/ volume keeps
the log file across updates) — we never build on prod.
Layout
| Path | Purpose |
|---|---|
src/providers/base.py |
Provider interfaces + SearchResult / ProviderError. |
src/providers/registry.py |
@register decorator → REGISTRY. |
src/providers/<type>.py |
One module per provider type. |
src/providers/pdf.py |
PDF detection + pypdf text extraction (used by the pipeline). |
src/pipeline_config.py |
In-code instances + pipeline order. |
src/pipeline.py |
Instance loader + search/read logic. |
src/settings.py |
Non-secret knobs (pydantic-settings). |
src/server.py |
build_server() with the 3 @mcp.tool definitions. |
main.py |
Thin entry point: build server, run streamable-http. |
tests/ |
pytest suite (network mocked with respx). |
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.