MCP Web Scraper
A Model Context Protocol server that scrapes static HTML with BeautifulSoup for analysis by MCP clients.
README
MCP Web Scraper
A standalone Model Context Protocol (MCP) server that scrapes static HTML with BeautifulSoup. LLM analysis is handled by your MCP client (e.g. Odysseus with Ollama, or Open WebUI) — this server only provides scraping tools.
Designed for container deployment over Streamable HTTP.
What it does
Exposes two MCP tools:
| Tool | Description |
|---|---|
scrape_url |
Fetch a URL and return title, text, optional CSS matches, and links |
extract_from_html |
Parse existing HTML with a CSS selector |
Limitations: static HTML only (no JavaScript rendering), no anti-bot bypass. Respect site terms and rate limits.
Prerequisites
- Docker
Quick start (GHCR image)
After your first GitHub Release, pull the published image:
docker pull ghcr.io/<owner>/mcp-web-scraper:latest
docker run -d \
--name mcp-web-scraper \
-p 127.0.0.1:8000:8000 \
ghcr.io/<owner>/mcp-web-scraper:latest
Verify health:
curl http://127.0.0.1:8000/health
MCP endpoint: http://127.0.0.1:8000/mcp
Make the GHCR package public
- Go to your GitHub profile → Packages →
mcp-web-scraper - Package settings → Change visibility → Public
Local build
cp .env.example .env
docker compose up --build -d
Connect to Odysseus
Odysseus uses native Streamable HTTP MCP (transport http) via the Python SDK's streamablehttp_client. No MCPO bridge required.
Odysseus requirements this server meets
| Requirement | How this server satisfies it |
|---|---|
Transport http |
Streamable HTTP at /mcp with stateless_http=True and json_response=True |
| MCP URL | Use http://<host>:8000/mcp exactly (Odysseus McpManager._connect_http) |
| No OAuth | Server does not return 401; Odysseus only starts OAuth on 401 responses |
| Tool discovery | initialize + list_tools expose scrape_url and extract_from_html with JSON Schema |
| Tool calls | Agent invokes mcp__{server_id}__scrape_url via session.call_tool |
| Fast connect | Startup completes within Odysseus's 8s HTTP connect window |
Important: Leave MCP_API_KEY empty when using Odysseus. Odysseus does not send a Bearer token for HTTP MCP servers, and a 401 response would trigger its OAuth flow.
- Run this MCP server (publish port
8000or attach to the same Docker network as Odysseus). - In Odysseus Settings → MCP (admin), add a server:
| Field | Value |
|---|---|
| Name | web-scraper |
| Transport | http |
| URL | http://mcp-web-scraper:8000/mcp (shared compose network) or http://host.docker.internal:8000/mcp (MCP on host) |
- Configure your LLM in Odysseus separately — the agent uses scraping tools from this server and its own model for analysis.
Compose overlay with Odysseus
Add to your Odysseus docker-compose.yml or an override file:
services:
mcp-web-scraper:
image: ghcr.io/<owner>/mcp-web-scraper:latest
ports:
- "127.0.0.1:8000:8000"
restart: unless-stopped
odysseus:
depends_on:
- mcp-web-scraper
Use http://mcp-web-scraper:8000/mcp as the MCP URL inside Odysseus.
Shared network overlay (separate compose projects)
If Odysseus and this server run in different docker compose projects, attach this service to the Odysseus network:
# Confirm the Odysseus network name (usually odysseus_default)
docker network ls | rg odysseus
docker compose -f docker-compose.yml -f docker-compose.odysseus.yml up -d
Register URL: http://mcp-web-scraper:8000/mcp
Odysseus in Docker (most common)
Odysseus's backend resolves the MCP URL inside its container. The hostname mcp-web-scraper only works when both containers share a Docker network; otherwise you get Temporary failure in name resolution and Odysseus may return HTTP 500.
Quickest fix — MCP published on the host (MCP_BIND=0.0.0.0, the default):
| Field | Value |
|---|---|
| Transport | http (Streamable HTTP) |
| URL | http://host.docker.internal:8000/mcp |
Odysseus's docker-compose.yml already sets extra_hosts: host.docker.internal:host-gateway. Verify from the Odysseus container:
docker exec -it odysseus-odysseus-1 curl -sf http://host.docker.internal:8000/health
Shared-network fix — use the service name instead of the host:
docker network ls | rg odysseus # e.g. odysseus_default
docker compose -f docker-compose.yml -f docker-compose.odysseus.yml up -d
Then register http://mcp-web-scraper:8000/mcp.
Which MCP URL to use
| Odysseus runs… | MCP runs… | URL in Odysseus |
|---|---|---|
| In Docker (same compose/network) | In Docker (same network) | http://mcp-web-scraper:8000/mcp |
| In Docker (separate compose) | In Docker | docker-compose.odysseus.yml overlay, then http://mcp-web-scraper:8000/mcp |
| In Docker | On Docker host (MCP_BIND=0.0.0.0) |
http://host.docker.internal:8000/mcp |
| On host | In Docker (127.0.0.1:8000 publish) |
http://127.0.0.1:8000/mcp |
| On host | In Docker (0.0.0.0:8000 publish) |
http://127.0.0.1:8000/mcp or your host LAN IP |
Do not use your host LAN IP unless MCP_BIND publishes port 8000 on 0.0.0.0. The default compose file uses 0.0.0.0; set MCP_BIND=127.0.0.1 only if you want host-local access.
Troubleshooting Odysseus POST /api/mcp/servers → 500
The browser error is reported by Odysseus (localhost:7000), not this MCP server. A failed MCP connection normally returns HTTP 200 with "status": "error" in the JSON body — a 500 means Odysseus raised an unhandled exception.
-
Read Odysseus logs while saving the server (replace the container name if different):
docker logs -f odysseus-odysseus-1 2>&1 | rg -i "mcp|error|traceback"Or, if Odysseus runs directly on the host, check the terminal where
app.pyis running. -
Test reachability from Odysseus (run inside the Odysseus container):
docker exec -it odysseus-odysseus-1 curl -sf http://mcp-web-scraper:8000/health # or, MCP on host: docker exec -it odysseus-odysseus-1 curl -sf http://host.docker.internal:8000/healthExpect
{"status":"ok"}. If this fails, fix networking before re-saving in the UI. -
Confirm this server is up on the host:
curl -sf http://127.0.0.1:8000/health -
Leave
MCP_API_KEYempty for Odysseus (see.env.example). Odysseus does not send Bearer tokens. -
Use the exact path
/mcp— not/or/sse. -
Pull or rebuild the latest image after compatibility fixes:
docker compose pull && docker compose up -d --build
Common log messages:
| Odysseus / client error | Fix |
|---|---|
Name or service not known |
Shared Docker network missing — use docker-compose.odysseus.yml |
Connection refused |
Wrong URL, or MCP published only on 127.0.0.1 while Odysseus is in another container |
no such column: oauth_tokens |
Run Odysseus once so DB migrations apply, or upgrade Odysseus |
403 Forbidden on /mcp |
Unset MCP_API_KEY in this server's .env |
Connect to Open WebUI
Open WebUI v0.6.31+ supports MCP Streamable HTTP natively.
- Admin Settings → External Tools → Add Server
- Type: MCP (Streamable HTTP)
- URL:
http://host.docker.internal:8000/mcp - Auth: None (or Bearer if
MCP_API_KEYis set) - Enable Function Calling: Native on your model
Configuration
Copy .env.example to .env:
| Variable | Default | Description |
|---|---|---|
MCP_HOST |
0.0.0.0 |
Bind address |
MCP_PORT |
8000 |
HTTP port |
SCRAPE_TIMEOUT_S |
30 |
HTTP fetch timeout (seconds) |
SCRAPE_MAX_BYTES |
2097152 |
Max download size (2 MB) |
MCP_API_KEY |
(empty) | Optional Bearer token for /mcp (incompatible with Odysseus) |
ALLOWED_ORIGINS |
* |
Origin allowlist; * disables DNS rebinding checks (recommended for Odysseus/Docker) |
USER_AGENT |
mcp-web-scraper/0.1.0 |
HTTP User-Agent header |
Security
- Do not expose this server unauthenticated on the public internet.
- Set
MCP_API_KEYfor Open WebUI or other clients that support Bearer auth. Do not enable it for Odysseus. - Scraping is unrestricted by default; only scrape sites you are permitted to access.
Creating a release
CI publishes to GHCR when a GitHub Release is created:
git tag v0.1.0
git push origin v0.1.0
Create a release from tag v0.1.0 on GitHub. The release workflow pushes:
ghcr.io/<owner>/mcp-web-scraper:0.1.0ghcr.io/<owner>/mcp-web-scraper:0.1ghcr.io/<owner>/mcp-web-scraper:0ghcr.io/<owner>/mcp-web-scraper:latest
Development
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
ruff check src tests
pytest
mcp-web-scraper
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.