mcp-ddg-research

mcp-ddg-research

Lightweight MCP server for DuckDuckGo search with HTML fallback, safe webpage fetching, caching, and clean text extraction.

Category
Visit Server

README

mcp-ddg-research

Lightweight MCP server for DuckDuckGo search with HTML fallback, safe webpage fetching, caching, and clean text extraction.

mcp-ddg-research is a self-hosted Python MCP server that exposes deterministic research primitives to MCP clients. It can run DuckDuckGo searches, fall back to DuckDuckGo's lightweight HTML endpoint when the ddgs provider fails, fetch webpages with SSRF protections, cache search/fetch responses, deduplicate URLs, and extract readable text from HTML pages.

The MCP client or agent is responsible for reasoning over the returned data. This server only returns structured search results and fetched page text.

What This Project Does

  • Searches DuckDuckGo through ddgs.DDGS().text(...).
  • Falls back to https://html.duckduckgo.com/html/ when ddgs fails, times out, rate limits, raises, or returns no results.
  • Parses DuckDuckGo HTML fallback results with BeautifulSoup.
  • Resolves DuckDuckGo redirect URLs such as /l/?uddg=....
  • Deduplicates normalized result URLs.
  • Fetches webpages with strict URL and DNS safety checks.
  • Follows redirects manually and validates every redirect target.
  • Extracts clean text from HTML by removing script, style, navigation, footer, and similar boilerplate.
  • Caches search and fetch responses in a file-based JSON cache.
  • Provides a simple deep search tool that searches once and fetches top result pages concurrently.

What This Project Does Not Do

  • No LLM integration.
  • No summarization.
  • No report generation.
  • No browser automation.
  • No proxy rotation.
  • No captcha bypassing.
  • No ranking with model endpoints.
  • No OpenAI, Anthropic, Ollama, LM Studio, or other model endpoint support.

Why HTML Fallback Exists

The ddgs package is the preferred provider because it offers a simple Python API and handles DuckDuckGo search details for normal use. Search providers can still fail because of network timeouts, temporary provider errors, rate limits, empty responses, dependency import problems, or upstream behavior changes.

When that happens, this server falls back to DuckDuckGo's lightweight HTML endpoint. The fallback uses conservative request defaults, browser-like headers, and BeautifulSoup selectors for .result, .result__a, and .result__snippet.

Available MCP Tools

ddg_search

Search DuckDuckGo and return structured results.

Arguments:

{
  "query": "python mcp server fastmcp",
  "max_results": 10,
  "search_window": null,
  "safe_search": "off",
  "time_filter": "month",
  "blocked_domains": [],
  "allowed_domains": [],
  "preferred_domains": []
}

Argument rules:

  • query: string, required.
  • max_results: integer, default 10, minimum 1, maximum 30.
  • search_window: optional integer, minimum 1, maximum 100. If provided, this is the provider request size before dedupe/domain controls/final cap.
  • safe_search: one of off, moderate, strict, default off.
  • time_filter: optional, one of day, week, month, year.
  • blocked_domains: optional list of domains to remove from results, default [].
  • allowed_domains: optional list of domains to keep, default [].
  • preferred_domains: optional list of domains to move earlier while preserving stable order, default [].

Response example:

{
  "query": "python mcp server fastmcp",
  "provider": "ddgs",
  "results": [
    {
      "title": "MCP Python SDK",
      "url": "https://github.com/modelcontextprotocol/python-sdk",
      "snippet": "Python SDK for Model Context Protocol servers and clients."
    }
  ],
  "cached": false,
  "error": null
}

web_fetch

Fetch a single webpage and return clean text.

Arguments:

{
  "url": "https://example.com/article",
  "max_chars": 12000
}

Argument rules:

  • url: HTTP or HTTPS URL.
  • max_chars: integer, default 12000, minimum 1000, maximum 50000.

Response example:

{
  "url": "https://example.com/article",
  "final_url": "https://example.com/article",
  "title": "Example Article",
  "content": "Readable extracted page text...",
  "content_type": "text/html; charset=utf-8",
  "cached": false,
  "success": true,
  "error": null
}

ddg_deep_search

Search once, fetch top result pages concurrently, and return sources plus page content.

Arguments:

{
  "query": "model context protocol python sdk",
  "max_results": 10,
  "search_window": null,
  "max_pages": 5,
  "max_chars_per_page": 12000,
  "safe_search": "off",
  "time_filter": "year",
  "blocked_domains": [],
  "allowed_domains": [],
  "preferred_domains": [],
  "max_concurrency": null
}

Argument rules:

  • query: string, required.
  • max_results: integer, default 10, minimum 1, maximum 30.
  • search_window: optional integer, minimum 1, maximum 100. Passed through to ddg_search as the provider request size before final result capping.
  • max_pages: integer, default 5, minimum 1, maximum 10.
  • max_chars_per_page: integer, default 12000, minimum 1000, maximum 50000.
  • safe_search: one of off, moderate, strict, default off.
  • time_filter: optional, one of day, week, month, year.
  • blocked_domains: optional list of domains to remove from search results before fetching, default [].
  • allowed_domains: optional list of domains to keep before fetching, default [].
  • preferred_domains: optional list of domains to move earlier before fetching, default [].
  • max_concurrency: optional per-call page fetch concurrency, minimum 1, maximum 12. If omitted, MAX_CONCURRENCY is used.

Response example:

{
  "query": "model context protocol python sdk",
  "search_provider": "ddgs",
  "sources": [
    {
      "title": "MCP Python SDK",
      "url": "https://github.com/modelcontextprotocol/python-sdk",
      "snippet": "Python SDK for Model Context Protocol servers and clients."
    }
  ],
  "pages": [
    {
      "title": "MCP Python SDK",
      "url": "https://github.com/modelcontextprotocol/python-sdk",
      "final_url": "https://github.com/modelcontextprotocol/python-sdk",
      "content": "Extracted page text..."
    }
  ],
  "failed_pages": [],
  "cached": false
}

Domain Controls

Domain controls are opt-in. If you do not pass blocked_domains, allowed_domains, preferred_domains, or search_window, ddg_search requests exactly max_results from DuckDuckGo and preserves DuckDuckGo's default ranking order after URL deduplication. The server does not apply a built-in source bias, source boost, or domain blocklist.

When any domain control is provided, the server requests a larger internal window from the provider before applying dedupe and domain controls. The default window is:

min(max_results * 3, 50)

The final response is still capped to max_results. You can override the provider request size with search_window, minimum 1, maximum 100. This is useful when a desired allowed/preferred domain might appear outside the first max_results provider results.

Domain inputs are normalized by lowercasing, removing URL schemes, removing paths and query strings, and stripping a leading www.. Matching supports exact domains and subdomains. For example, docs.example.com matches example.com, but example.com.evil.com does not.

Filtering order:

  1. Apply allowed_domains if provided.
  2. Apply blocked_domains if provided.
  3. Apply preferred_domains if provided.

preferred_domains performs a stable partition: preferred matches move earlier, relative order is preserved inside the preferred and non-preferred groups, and no numeric score is invented.

Block domains:

{
  "query": "self hosted photo backup",
  "blocked_domains": ["example.com", "old-docs.example.org"]
}

Allow only specific domains:

{
  "query": "python mcp server",
  "allowed_domains": ["github.com", "modelcontextprotocol.io"]
}

Prefer domains without excluding others:

{
  "query": "duckduckgo html search endpoint",
  "preferred_domains": ["duckduckgo.com", "github.com"]
}

Search a larger internal window before applying domain controls:

{
  "query": "python mcp server",
  "max_results": 10,
  "search_window": 40,
  "allowed_domains": ["github.com", "modelcontextprotocol.io"]
}

Deep search with the same search window behavior:

{
  "query": "model context protocol python sdk",
  "max_results": 10,
  "max_pages": 5,
  "search_window": 40,
  "preferred_domains": ["github.com"]
}

Limit deep-search fetch concurrency for one call:

{
  "query": "model context protocol python sdk",
  "max_pages": 5,
  "max_concurrency": 2
}

Docker Stdio Usage

Build the local image:

docker build -t mcp-ddg-research:local .

Run the server over stdio. This mode is auth-free because the MCP client owns stdin/stdout and there is no listening network socket:

docker run --rm -i -v "$PWD/data:/data" mcp-ddg-research:local

Docker Stdio MCP Client Configuration

{
  "mcpServers": {
    "ddg-research": {
      "command": "docker",
      "args": [
        "run",
        "--rm",
        "-i",
        "-v",
        "/opt/mcp-ddg-research/data:/data",
        "mcp-ddg-research:local"
      ]
    }
  }
}

docker-compose Usage

The included compose file starts the server in streamable HTTP mode on /mcp. It maps host port 49317 to container port 8000 and requires Authorization: Bearer change-me-now by default.

Build and start the service:

docker compose up --build ddg-research

The compose file persists cache data at:

~/docker/docker-data/mcp-ddg-research/cache

The checked-in compose token is the placeholder change-me-now. It is acceptable for local smoke tests only. Replace MCP_AUTH_TOKEN in docker-compose.yml before using LAN, VPN, reverse-proxy, or Cloudflare Tunnel deployments.

The compose file defaults MCP_ALLOWED_HOSTS=* and MCP_ALLOWED_ORIGINS=* so the same container can run behind a LAN IP, hostname, domain, reverse proxy, or HTTPS endpoint. In MCP SDK 1.27.2, wildcard Host/Origin validation is not supported by the DNS rebinding middleware, so wildcard mode disables the SDK Host/Origin allowlist and relies on the bearer token. To enable strict Host/Origin checks, set exact comma-separated values such as:

MCP_ALLOWED_HOSTS="example.com,example.com:443,localhost:49317"
MCP_ALLOWED_ORIGINS="https://example.com,http://localhost:*"

LAN HTTP Example

Set a real token in docker-compose.yml and start the server:

docker compose up -d --build

Use your server's LAN IP in the client URL:

http://YOUR_SERVER_IP:49317/mcp

OpenCode remote MCP configuration for a LAN deployment:

{
  "mcp": {
    "ddg-research": {
      "type": "remote",
      "enabled": true,
      "url": "http://YOUR_SERVER_IP:49317/mcp",
      "oauth": false,
      "headers": {
        "Authorization": "Bearer change-me-now"
      }
    }
  }
}

HTTPS Reverse Proxy Example

Run the container on the server and terminate TLS in a reverse proxy. The proxy should forward /mcp to http://127.0.0.1:49317/mcp and preserve standard upgrade/streaming behavior.

Minimal Nginx-style location:

location /mcp {
    proxy_pass http://127.0.0.1:49317/mcp;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_buffering off;
}

OpenCode configuration for the HTTPS endpoint:

{
  "mcp": {
    "ddg-research": {
      "type": "remote",
      "enabled": true,
      "url": "https://your-domain.example/mcp",
      "oauth": false,
      "headers": {
        "Authorization": "Bearer change-me-now"
      }
    }
  }
}

Cloudflare Tunnel Example

Cloudflare Tunnel lets cloudflared make outbound-only connections from your server to Cloudflare, so you can publish the MCP HTTP endpoint without opening an inbound router/firewall port.

In the Cloudflare dashboard, create a tunnel and add a public hostname such as:

https://mcp.example.com

If cloudflared runs on the host, set the tunnel service URL to:

http://127.0.0.1:49317

If cloudflared runs as another service in the same compose project/network, set the tunnel service URL to the container service name and internal port:

http://ddg-research:8000

Minimal compose service example for token-managed tunnels:

cloudflared:
  image: cloudflare/cloudflared:latest
  restart: unless-stopped
  command: tunnel --no-autoupdate run --token ${CLOUDFLARE_TUNNEL_TOKEN}
  depends_on:
    - ddg-research

Keep CLOUDFLARE_TUNNEL_TOKEN outside version control. In OpenCode, use the public HTTPS URL and keep the MCP bearer token header:

{
  "mcp": {
    "ddg-research": {
      "type": "remote",
      "enabled": true,
      "url": "https://mcp.example.com/mcp",
      "oauth": false,
      "headers": {
        "Authorization": "Bearer change-me-now"
      }
    }
  }
}

For production, replace change-me-now with a long random token. Cloudflare Tunnel protects the network path, but the MCP server should still require its own bearer token.

Do not expose HTTP mode to an untrusted network without HTTPS and a strong MCP_AUTH_TOKEN. If MCP_AUTH_TOKEN is unset in HTTP mode, the server logs a warning and accepts unauthenticated HTTP requests.

For MCP stdio clients, direct docker run -i is usually simpler than compose because the client owns stdin/stdout.

HTTP Smoke Tests

Raw curl is useful for checking HTTP authentication and Host handling, but it does not perform a complete MCP streamable HTTP session. A request with the correct bearer token may therefore return 406 Not Acceptable because curl did not send the MCP client's expected Accept: text/event-stream negotiation headers. That still proves the request passed bearer-token auth and Host validation.

With the compose server running and the default compose token:

curl -i http://127.0.0.1:49317/mcp

Expected: 401 Unauthorized.

curl -i \
  -H "Host: YOUR_SERVER_IP:49317" \
  -H "Authorization: Bearer change-me-now" \
  http://127.0.0.1:49317/mcp

Expected: usually 406 Not Acceptable from raw curl, but not 401 Unauthorized and not 421 Misdirected Request.

With a real MCP client, such as OpenCode configured with the same URL and Authorization header, ListTools and CallTool should work for ddg_search, web_fetch, and ddg_deep_search.

Environment Variables

Variable Default Description
MCP_CACHE_DIR /data/cache Directory for JSON cache files.
DDG_CACHE_TTL_SECONDS 21600 Search cache TTL in seconds.
FETCH_CACHE_TTL_SECONDS 7200 Web fetch cache TTL in seconds.
DDG_TIMEOUT_SECONDS 15 DuckDuckGo provider and fallback timeout in seconds.
FETCH_TIMEOUT_SECONDS 15 Web fetch timeout in seconds.
MAX_CONCURRENCY 5 Default deep search page fetch concurrency limit when max_concurrency is omitted. Runtime caps this at 12.
MCP_TRANSPORT stdio MCP transport. stdio is the default. http uses streamable HTTP when supported by the installed SDK.
MCP_HOST 0.0.0.0 Host used for optional streamable HTTP mode.
MCP_PORT 8000 Port used for optional streamable HTTP mode.
MCP_AUTH_TOKEN unset Bearer token for HTTP mode. The included compose file sets this to change-me-now; replace it before real deployments. If unset, HTTP mode logs a warning and runs without auth.
MCP_ALLOWED_HOSTS * Comma-separated Host allowlist for HTTP mode. * supports arbitrary deployment hosts by disabling SDK Host/Origin rebinding checks.
MCP_ALLOWED_ORIGINS * Comma-separated Origin allowlist for HTTP mode. * supports arbitrary origins by disabling SDK Host/Origin rebinding checks.

Cache Behavior

Search results are cached under the search cache namespace. Fetch responses are cached under the fetch cache namespace. Cache keys are SHA256 hashes of stable JSON payloads, so equivalent tool arguments map to the same file path.

Cache files are written atomically by writing a temporary file in the target cache directory and then renaming it into place. Corrupt, malformed, or expired cache files are ignored safely.

The default compose configuration persists cache files in /data/cache, with ~/docker/docker-data/mcp-ddg-research mounted into the container.

Rate Limit Notes

Defaults are intentionally conservative:

  • ddg_search defaults to 10 results and caps at 30.
  • ddg_deep_search defaults to 5 fetched pages and caps at 10.
  • Deep search concurrency defaults to 5.
  • Search and fetch results are cached to reduce repeated DuckDuckGo and website hits.

This project does not rotate proxies, bypass captchas, or attempt to evade rate limits. If DuckDuckGo blocks or rate limits requests, the tool returns structured errors instead of retrying aggressively.

SSRF and Security Protections

web_fetch only allows http and https URLs. It blocks known local or internal hostnames, including:

  • localhost
  • metadata
  • metadata.google.internal
  • hostnames ending in .local, .localhost, .internal, .lan, .intranet

It also rejects IP addresses in private, loopback, link-local, reserved, multicast, or unspecified ranges, including:

  • 0.0.0.0/8
  • 10.0.0.0/8
  • 127.0.0.0/8
  • 169.254.0.0/16
  • 172.16.0.0/12
  • 192.168.0.0/16
  • ::1/128
  • fc00::/7
  • fe80::/10

DNS is resolved before fetching. If any resolved address is unsafe, the request is rejected. Redirects are followed manually, and every redirect target is validated before the next request.

Unsupported schemes such as file://, ftp://, ssh://, gopher://, and data: are never fetched.

Development Setup

Python 3.12 is required.

Create and activate a virtual environment:

python3.12 -m venv .venv
source .venv/bin/activate

Install the package with development tools:

python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

Run the MCP server locally:

python -m mcp_ddg_research.server

Test Commands

Run tests:

python -m pytest

Run lint:

python -m ruff check .

Build a wheel/sdist using the configured build backend:

python -m pip install build
python -m build

Release Automation

Releases are automated by .github/workflows/release.yml when commits or release tags are pushed. The workflow is Python-native:

  1. Install the project with development dependencies.
  2. Run Ruff, pytest, compile checks, Python package build, and a Docker build.
  3. On main branch pushes, use Python Semantic Release to create the next GitHub release from conventional commits.
  4. On v* tag pushes, treat the pushed tag as the release tag.
  5. If a release or release tag is present, build and push multi-architecture Docker images for linux/amd64 and linux/arm64.

The workflow publishes these image tags:

DOCKERHUB_USERNAME/mcp-ddg-research:latest
DOCKERHUB_USERNAME/mcp-ddg-research:vX.Y.Z
ghcr.io/isyuricunha/mcp-ddg-research:latest
ghcr.io/isyuricunha/mcp-ddg-research:vX.Y.Z

Required repository secrets:

Secret Purpose
DOCKERHUB_USERNAME Docker Hub namespace for the published image.
DOCKERHUB_TOKEN Docker Hub access token used by docker/login-action.
GITHUB_TOKEN Provided automatically by GitHub Actions for GitHub releases and GHCR publishing.

Use conventional commits to drive release versions:

  • fix: ... and perf: ... create patch releases.
  • feat: ... creates minor releases while the project is in 0.x.
  • Breaking changes are capped to a minor release while the project is in 0.x; after 1.0.0, they create major releases.
  • docs:, ci:, chore:, test:, style:, and refactor: do not create a release by default.

The release workflow updates pyproject.toml and src/mcp_ddg_research/__init__.py during semantic-release commits. It does not maintain a changelog file. It is intentionally skipped for documentation-only pushes and compose-file-only pushes.

Manual milestone releases are also supported. Create and push a vX.Y.Z tag that points at the intended release commit, and the tag workflow publishes the same Docker Hub and GHCR tags.

Limitations

  • DuckDuckGo HTML fallback does not support every option exposed by DuckDuckGo's full web interface.
  • time_filter is applied to the ddgs provider. The HTML fallback only sends the query and safe-search parameter.
  • PDF parsing is not implemented in v1.
  • JavaScript-rendered pages are not rendered because there is no browser automation.
  • Some websites block automated HTTP clients or return incomplete content.
  • DNS safety checks reduce SSRF risk but cannot make arbitrary third-party fetching risk-free.

Optional Future Roadmap

These are optional future improvements, not current behavior:

  • Add configurable per-domain fetch throttling.
  • Add cache pruning utilities.
  • Add optional robots.txt awareness.
  • Add additional text extraction heuristics for common article layouts.
  • Add more integration tests around redirect chains and text content types.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured