research-mcp

research-mcp

A stateless MCP server that provides web search, single page reading, and batch page reading tools, with automatic failover across multiple search and content extraction providers.

Category
Visit Server

README

research-mcp

A stateless MCP facade that hides a pyramid of search/read providers behind a single streamable-http MCP endpoint and exposes just 3 clean tools with good Russian help texts. An LLM gets a simple "search → read" toolset; behind it, several providers are tried, merged, and failed over automatically.

The app does no authentication — it is published through Traefik + basicAuth on the host. It holds no application state: the only thing persisted is a log file under data/ (kept on a volume).

Tools

Tool What it does
web_search(query, num_results=8, page=1, language=None) Search across all enabled providers, merge + dedup → ranked list (title, URL, snippet). Search only.
read_page(url) One page or PDF → clean Markdown. Auto-detects type, walks the read pipeline (light → heavy) until one succeeds.
read_pages(urls) Up to 20 urls concurrently → list of {url, ok, markdown|error}.

Architecture: types + instances

Providers are plugins. We separate:

  • type — an implementation class (e.g. the searxng search provider), one per module in src/providers/, registered with @register("type").
  • instance — a configured copy of a type with its secrets/URL resolved from named environment variables (multiple instances of one type are allowed, e.g. tavily-1 / tavily-2 with different keys).

Which instances exist and the order each pipeline tries them is configured in code (src/pipeline_config.py); keys/URLs come from ENV by variable name.

  • Search pipeline (searxng → serper → exa): enabled instances run concurrently; results are merged and deduplicated by normalized URL (earlier pipeline position wins), then trimmed to num_results.
  • Read pipeline (trafilatura → jina → crawl4ai → tavily-1 → tavily-2 → firecrawl): a single probe GET classifies the url. PDFs (Content-Type / .pdf / %PDF magic) are extracted with pypdf; for HTML, that same body is handed to trafilatura so the hot path never GETs twice, then the remaining instances are tried in order and the first to return content >= FALLBACK_MIN_CHARS wins.

Cross-cutting: one transient retry (5xx / transport errors) with a short backoff; 402 (out of credits) / 429 (rate limited) are treated as a provider failure → next instance (this is what makes tavily-1 → tavily-2 fail over).

An instance is enabled only if its required env var(s) are set; otherwise it is skipped with a log line. trafilatura needs no config (always on); jina works keyless (its key is optional). At startup the server requires at least one search and one read instance, else it exits with a clear message.

Adding a provider

  1. Write src/providers/<type>.py with a class decorated @register("<type>") implementing SearchProvider.search(...) or ReadProvider.read(...).
  2. Import the module in src/providers/__init__.py (so the decorator runs).
  3. Add an Instance("name", "<type>", api_key_env="YOUR_ENV_NAME") line in src/pipeline_config.py and reference its name in SEARCH_PIPELINE / READ_PIPELINE. Use the ENV var NAME, never a value.
  4. Document the env var in .env.example.

Quick start

make install                # create .venv + install dev/test deps
cp .env.example .env        # fill in the keys you have  (shortcut: make env)
make test                   # run tests
make run                    # run the server (streamable-http on MCP_HOST:MCP_PORT, endpoint /mcp)

Configuration

All config comes from ENV / .env (see .env.example). Provider secrets/URLs are read by name in the instance loader, not declared as Settings fields. The non-secret knobs (all defaulted): MCP_HOST, MCP_PORT, LOG_LEVEL, LOG_FILE, LOG_ROTATION, LOG_RETENTION, REQUEST_TIMEOUT, FALLBACK_MIN_CHARS, READ_PAGES_CONCURRENCY, RETRIES. The read_pages per-call url cap is a fixed 20 (hard constant, matching the tool description) — not configurable.

Provider env vars: SEARXNG_URL, SERPER_API_KEY, EXA_API_KEY, JINA_API_KEY (optional), CRAWL4AI_URL + CRAWL4AI_TOKEN, TAVILY_1_API_KEY, TAVILY_2_API_KEY, FIRECRAWL_API_KEY.

Proxy

Any external instance can be routed through its own SOCKS5/HTTP proxy by setting <INSTANCE>_PROXY — useful for clean egress past IP-based blocks (e.g. Cloudflare in front of Exa). Supported per instance: EXA_PROXY, SERPER_PROXY, JINA_PROXY, TAVILY_1_PROXY, TAVILY_2_PROXY, FIRECRAWL_PROXY. Internal instances (searxng, crawl4ai, trafilatura) have no proxy.

The value is passed straight to httpx; socks5://host:port does proxy-side DNS (the target hostname is resolved by the proxy, like curl --socks5-hostname), and socks5h:// / http://host:port are also accepted. Unset → that instance goes direct. The pipeline keeps one pooled httpx client per distinct proxy URL (and one direct client), selected per instance, so proxied and direct providers run side by side. Needs the socks extra (httpx[socks], already pinned).

Logging

Besides stderr (captured by Docker's rotation-capped json-file driver), the server writes a persistent log file to data/research-mcp.log (default; LOG_ROTATION=20 MB, LOG_RETENTION=14 days). It lives on the data/ volume, so it survives container restarts and image updates. The file carries one per-request line per tool call — search (query, which provider instances actually ran, result count, latency) and read (url, the winning provider/tier or pdf, ok, latency), plus a read_pages count=N ok=K summary — making it useful for analyzing how requests distribute across provider tiers. No request bodies or secrets are logged, only urls/queries, provider names, counts, timings.

Deployment

CI builds the image and pushes it to ghcr.io (testbuild, tags latest + sha). On prod we pull the prebuilt image via docker-compose.yml (behind Traefik + basicAuth, watchtower auto-updates latest; the data/ volume keeps the log file across updates) — we never build on prod.

Layout

Path Purpose
src/providers/base.py Provider interfaces + SearchResult / ProviderError.
src/providers/registry.py @register decorator → REGISTRY.
src/providers/<type>.py One module per provider type.
src/providers/pdf.py PDF detection + pypdf text extraction (used by the pipeline).
src/pipeline_config.py In-code instances + pipeline order.
src/pipeline.py Instance loader + search/read logic.
src/settings.py Non-secret knobs (pydantic-settings).
src/server.py build_server() with the 3 @mcp.tool definitions.
main.py Thin entry point: build server, run streamable-http.
tests/ pytest suite (network mocked with respx).

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured