mcp-ddg-research
Lightweight MCP server for DuckDuckGo search with HTML fallback, safe webpage fetching, caching, and clean text extraction.
README
mcp-ddg-research
Lightweight MCP server for DuckDuckGo search with HTML fallback, safe webpage fetching, caching, and clean text extraction.
mcp-ddg-research is a self-hosted Python MCP server that exposes deterministic research primitives to MCP clients. It can run DuckDuckGo searches, fall back to DuckDuckGo's lightweight HTML endpoint when the ddgs provider fails, fetch webpages with SSRF protections, cache search/fetch responses, deduplicate URLs, and extract readable text from HTML pages.
The MCP client or agent is responsible for reasoning over the returned data. This server only returns structured search results and fetched page text.
What This Project Does
- Searches DuckDuckGo through
ddgs.DDGS().text(...). - Falls back to
https://html.duckduckgo.com/html/whenddgsfails, times out, rate limits, raises, or returns no results. - Parses DuckDuckGo HTML fallback results with BeautifulSoup.
- Resolves DuckDuckGo redirect URLs such as
/l/?uddg=.... - Deduplicates normalized result URLs.
- Fetches webpages with strict URL and DNS safety checks.
- Follows redirects manually and validates every redirect target.
- Extracts clean text from HTML by removing script, style, navigation, footer, and similar boilerplate.
- Caches search and fetch responses in a file-based JSON cache.
- Provides a simple deep search tool that searches once and fetches top result pages concurrently.
What This Project Does Not Do
- No LLM integration.
- No summarization.
- No report generation.
- No browser automation.
- No proxy rotation.
- No captcha bypassing.
- No ranking with model endpoints.
- No OpenAI, Anthropic, Ollama, LM Studio, or other model endpoint support.
Why HTML Fallback Exists
The ddgs package is the preferred provider because it offers a simple Python API and handles DuckDuckGo search details for normal use. Search providers can still fail because of network timeouts, temporary provider errors, rate limits, empty responses, dependency import problems, or upstream behavior changes.
When that happens, this server falls back to DuckDuckGo's lightweight HTML endpoint. The fallback uses conservative request defaults, browser-like headers, and BeautifulSoup selectors for .result, .result__a, and .result__snippet.
Available MCP Tools
ddg_search
Search DuckDuckGo and return structured results.
Arguments:
{
"query": "python mcp server fastmcp",
"max_results": 10,
"search_window": null,
"safe_search": "off",
"time_filter": "month",
"blocked_domains": [],
"allowed_domains": [],
"preferred_domains": []
}
Argument rules:
query: string, required.max_results: integer, default10, minimum1, maximum30.search_window: optional integer, minimum1, maximum100. If provided, this is the provider request size before dedupe/domain controls/final cap.safe_search: one ofoff,moderate,strict, defaultoff.time_filter: optional, one ofday,week,month,year.blocked_domains: optional list of domains to remove from results, default[].allowed_domains: optional list of domains to keep, default[].preferred_domains: optional list of domains to move earlier while preserving stable order, default[].
Response example:
{
"query": "python mcp server fastmcp",
"provider": "ddgs",
"results": [
{
"title": "MCP Python SDK",
"url": "https://github.com/modelcontextprotocol/python-sdk",
"snippet": "Python SDK for Model Context Protocol servers and clients."
}
],
"cached": false,
"error": null
}
web_fetch
Fetch a single webpage and return clean text.
Arguments:
{
"url": "https://example.com/article",
"max_chars": 12000
}
Argument rules:
url: HTTP or HTTPS URL.max_chars: integer, default12000, minimum1000, maximum50000.
Response example:
{
"url": "https://example.com/article",
"final_url": "https://example.com/article",
"title": "Example Article",
"content": "Readable extracted page text...",
"content_type": "text/html; charset=utf-8",
"cached": false,
"success": true,
"error": null
}
ddg_deep_search
Search once, fetch top result pages concurrently, and return sources plus page content.
Arguments:
{
"query": "model context protocol python sdk",
"max_results": 10,
"search_window": null,
"max_pages": 5,
"max_chars_per_page": 12000,
"safe_search": "off",
"time_filter": "year",
"blocked_domains": [],
"allowed_domains": [],
"preferred_domains": [],
"max_concurrency": null
}
Argument rules:
query: string, required.max_results: integer, default10, minimum1, maximum30.search_window: optional integer, minimum1, maximum100. Passed through toddg_searchas the provider request size before final result capping.max_pages: integer, default5, minimum1, maximum10.max_chars_per_page: integer, default12000, minimum1000, maximum50000.safe_search: one ofoff,moderate,strict, defaultoff.time_filter: optional, one ofday,week,month,year.blocked_domains: optional list of domains to remove from search results before fetching, default[].allowed_domains: optional list of domains to keep before fetching, default[].preferred_domains: optional list of domains to move earlier before fetching, default[].max_concurrency: optional per-call page fetch concurrency, minimum1, maximum12. If omitted,MAX_CONCURRENCYis used.
Response example:
{
"query": "model context protocol python sdk",
"search_provider": "ddgs",
"sources": [
{
"title": "MCP Python SDK",
"url": "https://github.com/modelcontextprotocol/python-sdk",
"snippet": "Python SDK for Model Context Protocol servers and clients."
}
],
"pages": [
{
"title": "MCP Python SDK",
"url": "https://github.com/modelcontextprotocol/python-sdk",
"final_url": "https://github.com/modelcontextprotocol/python-sdk",
"content": "Extracted page text..."
}
],
"failed_pages": [],
"cached": false
}
Domain Controls
Domain controls are opt-in. If you do not pass blocked_domains,
allowed_domains, preferred_domains, or search_window, ddg_search
requests exactly max_results from DuckDuckGo and preserves DuckDuckGo's
default ranking order after URL deduplication. The server does not apply a
built-in source bias, source boost, or domain blocklist.
When any domain control is provided, the server requests a larger internal window from the provider before applying dedupe and domain controls. The default window is:
min(max_results * 3, 50)
The final response is still capped to max_results. You can override the
provider request size with search_window, minimum 1, maximum 100. This is
useful when a desired allowed/preferred domain might appear outside the first
max_results provider results.
Domain inputs are normalized by lowercasing, removing URL schemes, removing
paths and query strings, and stripping a leading www.. Matching supports exact
domains and subdomains. For example, docs.example.com matches example.com,
but example.com.evil.com does not.
Filtering order:
- Apply
allowed_domainsif provided. - Apply
blocked_domainsif provided. - Apply
preferred_domainsif provided.
preferred_domains performs a stable partition: preferred matches move earlier,
relative order is preserved inside the preferred and non-preferred groups, and
no numeric score is invented.
Block domains:
{
"query": "self hosted photo backup",
"blocked_domains": ["example.com", "old-docs.example.org"]
}
Allow only specific domains:
{
"query": "python mcp server",
"allowed_domains": ["github.com", "modelcontextprotocol.io"]
}
Prefer domains without excluding others:
{
"query": "duckduckgo html search endpoint",
"preferred_domains": ["duckduckgo.com", "github.com"]
}
Search a larger internal window before applying domain controls:
{
"query": "python mcp server",
"max_results": 10,
"search_window": 40,
"allowed_domains": ["github.com", "modelcontextprotocol.io"]
}
Deep search with the same search window behavior:
{
"query": "model context protocol python sdk",
"max_results": 10,
"max_pages": 5,
"search_window": 40,
"preferred_domains": ["github.com"]
}
Limit deep-search fetch concurrency for one call:
{
"query": "model context protocol python sdk",
"max_pages": 5,
"max_concurrency": 2
}
Docker Stdio Usage
Build the local image:
docker build -t mcp-ddg-research:local .
Run the server over stdio. This mode is auth-free because the MCP client owns stdin/stdout and there is no listening network socket:
docker run --rm -i -v "$PWD/data:/data" mcp-ddg-research:local
Docker Stdio MCP Client Configuration
{
"mcpServers": {
"ddg-research": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-v",
"/opt/mcp-ddg-research/data:/data",
"mcp-ddg-research:local"
]
}
}
}
docker-compose Usage
The included compose file starts the server in streamable HTTP mode on /mcp.
It maps host port 49317 to container port 8000 and requires
Authorization: Bearer change-me-now by default.
Build and start the service:
docker compose up --build ddg-research
The compose file persists cache data at:
~/docker/docker-data/mcp-ddg-research/cache
The checked-in compose token is the placeholder change-me-now. It is
acceptable for local smoke tests only. Replace MCP_AUTH_TOKEN in
docker-compose.yml before using LAN, VPN, reverse-proxy, or Cloudflare Tunnel
deployments.
The compose file defaults MCP_ALLOWED_HOSTS=* and MCP_ALLOWED_ORIGINS=* so
the same container can run behind a LAN IP, hostname, domain, reverse proxy, or
HTTPS endpoint. In MCP SDK 1.27.2, wildcard Host/Origin validation is not
supported by the DNS rebinding middleware, so wildcard mode disables the SDK
Host/Origin allowlist and relies on the bearer token. To enable strict
Host/Origin checks, set exact comma-separated values such as:
MCP_ALLOWED_HOSTS="example.com,example.com:443,localhost:49317"
MCP_ALLOWED_ORIGINS="https://example.com,http://localhost:*"
LAN HTTP Example
Set a real token in docker-compose.yml and start the server:
docker compose up -d --build
Use your server's LAN IP in the client URL:
http://YOUR_SERVER_IP:49317/mcp
OpenCode remote MCP configuration for a LAN deployment:
{
"mcp": {
"ddg-research": {
"type": "remote",
"enabled": true,
"url": "http://YOUR_SERVER_IP:49317/mcp",
"oauth": false,
"headers": {
"Authorization": "Bearer change-me-now"
}
}
}
}
HTTPS Reverse Proxy Example
Run the container on the server and terminate TLS in a reverse proxy. The proxy
should forward /mcp to http://127.0.0.1:49317/mcp and preserve standard
upgrade/streaming behavior.
Minimal Nginx-style location:
location /mcp {
proxy_pass http://127.0.0.1:49317/mcp;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_buffering off;
}
OpenCode configuration for the HTTPS endpoint:
{
"mcp": {
"ddg-research": {
"type": "remote",
"enabled": true,
"url": "https://your-domain.example/mcp",
"oauth": false,
"headers": {
"Authorization": "Bearer change-me-now"
}
}
}
}
Cloudflare Tunnel Example
Cloudflare Tunnel lets cloudflared make outbound-only connections from your
server to Cloudflare, so you can publish the MCP HTTP endpoint without opening
an inbound router/firewall port.
In the Cloudflare dashboard, create a tunnel and add a public hostname such as:
https://mcp.example.com
If cloudflared runs on the host, set the tunnel service URL to:
http://127.0.0.1:49317
If cloudflared runs as another service in the same compose project/network,
set the tunnel service URL to the container service name and internal port:
http://ddg-research:8000
Minimal compose service example for token-managed tunnels:
cloudflared:
image: cloudflare/cloudflared:latest
restart: unless-stopped
command: tunnel --no-autoupdate run --token ${CLOUDFLARE_TUNNEL_TOKEN}
depends_on:
- ddg-research
Keep CLOUDFLARE_TUNNEL_TOKEN outside version control. In OpenCode, use the
public HTTPS URL and keep the MCP bearer token header:
{
"mcp": {
"ddg-research": {
"type": "remote",
"enabled": true,
"url": "https://mcp.example.com/mcp",
"oauth": false,
"headers": {
"Authorization": "Bearer change-me-now"
}
}
}
}
For production, replace change-me-now with a long random token. Cloudflare
Tunnel protects the network path, but the MCP server should still require its
own bearer token.
Do not expose HTTP mode to an untrusted network without HTTPS and a strong
MCP_AUTH_TOKEN. If MCP_AUTH_TOKEN is unset in HTTP mode, the server logs a
warning and accepts unauthenticated HTTP requests.
For MCP stdio clients, direct docker run -i is usually simpler than compose because the client owns stdin/stdout.
HTTP Smoke Tests
Raw curl is useful for checking HTTP authentication and Host handling, but it
does not perform a complete MCP streamable HTTP session. A request with the
correct bearer token may therefore return 406 Not Acceptable because curl did
not send the MCP client's expected Accept: text/event-stream negotiation
headers. That still proves the request passed bearer-token auth and Host
validation.
With the compose server running and the default compose token:
curl -i http://127.0.0.1:49317/mcp
Expected: 401 Unauthorized.
curl -i \
-H "Host: YOUR_SERVER_IP:49317" \
-H "Authorization: Bearer change-me-now" \
http://127.0.0.1:49317/mcp
Expected: usually 406 Not Acceptable from raw curl, but not 401 Unauthorized
and not 421 Misdirected Request.
With a real MCP client, such as OpenCode configured with the same URL and
Authorization header, ListTools and CallTool should work for ddg_search,
web_fetch, and ddg_deep_search.
Environment Variables
| Variable | Default | Description |
|---|---|---|
MCP_CACHE_DIR |
/data/cache |
Directory for JSON cache files. |
DDG_CACHE_TTL_SECONDS |
21600 |
Search cache TTL in seconds. |
FETCH_CACHE_TTL_SECONDS |
7200 |
Web fetch cache TTL in seconds. |
DDG_TIMEOUT_SECONDS |
15 |
DuckDuckGo provider and fallback timeout in seconds. |
FETCH_TIMEOUT_SECONDS |
15 |
Web fetch timeout in seconds. |
MAX_CONCURRENCY |
5 |
Default deep search page fetch concurrency limit when max_concurrency is omitted. Runtime caps this at 12. |
MCP_TRANSPORT |
stdio |
MCP transport. stdio is the default. http uses streamable HTTP when supported by the installed SDK. |
MCP_HOST |
0.0.0.0 |
Host used for optional streamable HTTP mode. |
MCP_PORT |
8000 |
Port used for optional streamable HTTP mode. |
MCP_AUTH_TOKEN |
unset | Bearer token for HTTP mode. The included compose file sets this to change-me-now; replace it before real deployments. If unset, HTTP mode logs a warning and runs without auth. |
MCP_ALLOWED_HOSTS |
* |
Comma-separated Host allowlist for HTTP mode. * supports arbitrary deployment hosts by disabling SDK Host/Origin rebinding checks. |
MCP_ALLOWED_ORIGINS |
* |
Comma-separated Origin allowlist for HTTP mode. * supports arbitrary origins by disabling SDK Host/Origin rebinding checks. |
Cache Behavior
Search results are cached under the search cache namespace. Fetch responses are cached under the fetch cache namespace. Cache keys are SHA256 hashes of stable JSON payloads, so equivalent tool arguments map to the same file path.
Cache files are written atomically by writing a temporary file in the target cache directory and then renaming it into place. Corrupt, malformed, or expired cache files are ignored safely.
The default compose configuration persists cache files in /data/cache, with
~/docker/docker-data/mcp-ddg-research mounted into the container.
Rate Limit Notes
Defaults are intentionally conservative:
ddg_searchdefaults to 10 results and caps at 30.ddg_deep_searchdefaults to 5 fetched pages and caps at 10.- Deep search concurrency defaults to 5.
- Search and fetch results are cached to reduce repeated DuckDuckGo and website hits.
This project does not rotate proxies, bypass captchas, or attempt to evade rate limits. If DuckDuckGo blocks or rate limits requests, the tool returns structured errors instead of retrying aggressively.
SSRF and Security Protections
web_fetch only allows http and https URLs. It blocks known local or internal hostnames, including:
localhostmetadatametadata.google.internal- hostnames ending in
.local,.localhost,.internal,.lan,.intranet
It also rejects IP addresses in private, loopback, link-local, reserved, multicast, or unspecified ranges, including:
0.0.0.0/810.0.0.0/8127.0.0.0/8169.254.0.0/16172.16.0.0/12192.168.0.0/16::1/128fc00::/7fe80::/10
DNS is resolved before fetching. If any resolved address is unsafe, the request is rejected. Redirects are followed manually, and every redirect target is validated before the next request.
Unsupported schemes such as file://, ftp://, ssh://, gopher://, and data: are never fetched.
Development Setup
Python 3.12 is required.
Create and activate a virtual environment:
python3.12 -m venv .venv
source .venv/bin/activate
Install the package with development tools:
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
Run the MCP server locally:
python -m mcp_ddg_research.server
Test Commands
Run tests:
python -m pytest
Run lint:
python -m ruff check .
Build a wheel/sdist using the configured build backend:
python -m pip install build
python -m build
Release Automation
Releases are automated by .github/workflows/release.yml when commits or
release tags are pushed. The workflow is Python-native:
- Install the project with development dependencies.
- Run Ruff, pytest, compile checks, Python package build, and a Docker build.
- On
mainbranch pushes, use Python Semantic Release to create the next GitHub release from conventional commits. - On
v*tag pushes, treat the pushed tag as the release tag. - If a release or release tag is present, build and push multi-architecture
Docker images for
linux/amd64andlinux/arm64.
The workflow publishes these image tags:
DOCKERHUB_USERNAME/mcp-ddg-research:latest
DOCKERHUB_USERNAME/mcp-ddg-research:vX.Y.Z
ghcr.io/isyuricunha/mcp-ddg-research:latest
ghcr.io/isyuricunha/mcp-ddg-research:vX.Y.Z
Required repository secrets:
| Secret | Purpose |
|---|---|
DOCKERHUB_USERNAME |
Docker Hub namespace for the published image. |
DOCKERHUB_TOKEN |
Docker Hub access token used by docker/login-action. |
GITHUB_TOKEN |
Provided automatically by GitHub Actions for GitHub releases and GHCR publishing. |
Use conventional commits to drive release versions:
fix: ...andperf: ...create patch releases.feat: ...creates minor releases while the project is in0.x.- Breaking changes are capped to a minor release while the project is in
0.x; after1.0.0, they create major releases. docs:,ci:,chore:,test:,style:, andrefactor:do not create a release by default.
The release workflow updates pyproject.toml and
src/mcp_ddg_research/__init__.py during semantic-release commits. It does not
maintain a changelog file. It is intentionally skipped for documentation-only
pushes and compose-file-only pushes.
Manual milestone releases are also supported. Create and push a vX.Y.Z tag
that points at the intended release commit, and the tag workflow publishes the
same Docker Hub and GHCR tags.
Limitations
- DuckDuckGo HTML fallback does not support every option exposed by DuckDuckGo's full web interface.
time_filteris applied to theddgsprovider. The HTML fallback only sends the query and safe-search parameter.- PDF parsing is not implemented in v1.
- JavaScript-rendered pages are not rendered because there is no browser automation.
- Some websites block automated HTTP clients or return incomplete content.
- DNS safety checks reduce SSRF risk but cannot make arbitrary third-party fetching risk-free.
Optional Future Roadmap
These are optional future improvements, not current behavior:
- Add configurable per-domain fetch throttling.
- Add cache pruning utilities.
- Add optional robots.txt awareness.
- Add additional text extraction heuristics for common article layouts.
- Add more integration tests around redirect chains and text content types.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.