webmcp
MCP server for web search and content extraction using DuckDuckGo or SearXNG, with Playwright-based fetching and LLM-powered data extraction.
README
webmcp
webmcp is an MCP server for web search and content extraction. LLM agents can use it to:
- search the web with DuckDuckGo (default) or SearXNG (optional)
- fetch and clean page content from one or more URLs
- send cleaned content to a local LLM for structured extraction
Features
search_web(query, limit=10)returns web results (title, URL, description)extract(urls, prompt=None, schema=None, use_browser=True)extracts data from pages- browser-based fetching with Playwright for JavaScript-heavy sites
- lightweight HTTP fetching mode for faster/simple pages
- persistent tool-call logging to
tool_calls.log.json - configurable search provider: DDG by default, optional SearXNG
Critical Requirement
For the main researcher llama.cpp server, include --webui-mcp-proxy in launch parameters. Without this flag, this workflow will not function correctly.
Prompting And Tested Setup
For best results, use research_prompt.txt as your system prompt. This prompt is a core part of the intended workflow and quality; it is effectively half of how this repository is meant to function.
Tested setup:
- Main researcher LLM:
Qwen3.5:27b-Q3_K_M.ggufvia llama.cpp on an RTX 4090, context length 200,000, about 40 tok/s. - Extract tool LLM:
Qwen3.5:9b-Q4_K_M.ggufvia llama.cpp on a GTX 1080 Ti, context length 32,768, about 40 tok/s. - This workflow has been tested with the llama.cpp WebUI specifically, and has not been validated with other MCP clients yet.
Requirements
- Python 3.10+
- A local OpenAI-compatible LLM endpoint (for example, llama.cpp, LM Studio, vLLM, ollama, etc)
Configuration
The app reads LLM settings from environment variables and supports a local .env file.
- Copy
.env.exampleto.env - Set values:
LLM_URL=http://localhost:1234
LLM_MODEL=your-model-name
SEARCH_PROVIDER=ddg
# Optional when SEARCH_PROVIDER=searxng
SEARXNG_URL=http://localhost:8080
LLM_URL and LLM_MODEL are required at startup.
SEARCH_PROVIDER defaults to ddg. Set it to searxng to replace DDG, and provide SEARXNG_URL.
Search Providers
search_web supports two providers:
ddg(default): uses DuckDuckGo viaddgssearxng: uses your SearXNG instance
SearXNG notes:
- Set
SEARCH_PROVIDER=searxng - Set
SEARXNG_URLto your instance base URL (for example,http://192.168.0.55:8888) webmcpcalls<SEARXNG_URL>/searchwithformat=json
Install
Install dependencies from the pinned requirements file:
pip install -r requirements.txt
python -m playwright install chromium
Run
python app.py
Server starts on:
http://0.0.0.0:8642
MCP Usage Notes
extract(..., use_browser=True)is best for dynamic pages that require JS rendering.extract(..., use_browser=False)is faster for static pages.- If extraction quality is poor, the LLM should provide a more specific
promptand/or a stricterschema.
TODO
- Revisit JS page rendering and extraction strategy. Right now, roughly 25-30% of pages return little or no usable content even when fetched successfully.
- Improve anti-bot handling for page fetches. Many targets still return 400-range errors, so investigate stronger browser mimicry (Playwright/Chromium behavior, headers, fingerprinting, and potentially user-agent/profile rotation).
License
MIT. See LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.