agentfetch
An open-source web retrieval MCP server that fetches, crawls, and searches the web, returning clean markdown for AI agents. It integrates with Claude MCP, LangChain, and other frameworks for agentic web access.
README
agentfetch
Open-source web retrieval built for AI agents.
agentfetch is a free, local alternative to Firecrawl, Exa, and Parallel.ai. It fetches any webpage, crawls any site, and searches the web — returning clean markdown that AI agents can consume directly.
Works with LangChain, LlamaIndex, CrewAI, AutoGen, Claude MCP, OpenAI function calling, Gemini, Groq, and plain REST. No vendor lock-in, no API keys required.
Install
Standard
pip install git+https://github.com/SID1ART/agentfetch.git
Cloud notebooks (Colab, Jupyter, Kaggle)
pip install https://github.com/SID1ART/agentfetch/archive/main.zip
With extra integrations
pip install "agentfetch[langchain,llamaindex,crewai] @ git+https://github.com/SID1ART/agentfetch.git"
pip install "agentfetch[search] @ git+https://github.com/SID1ART/agentfetch.git" # adds Google search engine
No PyPI account, no API tokens, no sign-up needed. GitHub is the source.
What makes it different
- Smart Mode Router — detects JavaScript-heavy SPAs (Next.js, Nuxt, React) and falls back to Playwright headless browser automatically. Static pages use direct HTTP.
- 5-layer extraction pipeline — trafilatura → newspaper3k → readability-lxml → BeautifulSoup → plain text. Best-effort extraction from any HTML.
- Never raises exceptions — always returns structured
FetchResultwith confidence scores, error fields, and injection detection. Agents can trust the output. - Information saturation crawling — no arbitrary depth limits. CrawlStopper detects vocabulary saturation and content redundancy, stopping when enough data is gathered.
- Prompt injection firewall — 13 patterns detected and redacted to
[REDACTED BY AGENTFETCH]. - Cloudflare bypass — optional
curl_cffiintegration with 12 TLS fingerprint profiles (Chrome 99–124, Safari 15/17) and auto-rotation. - Robots.txt compliance — optional async parser with caching, crawl-delay, and sitemap discovery.
- Proxy rotation — round-robin or random proxy pools with automatic failure tracking.
- Local LLM extraction — optional Ollama integration for structured data extraction without API costs.
- Redis-backed job queue — horizontal scaling for crawl operations with background workers.
Tools
| Tool | Description |
|---|---|
agent_scrape |
Fetch any URL; auto-detects browser need. Supports ScrapeConfig (wait_for selectors, tag filtering, citation markers, proxies, JA3 profile). |
agent_crawl |
Recursive crawl with information saturation stopping, robots.txt compliance, deduplication. |
agent_search |
Web search via SearXNG, DuckDuckGo, Google, or Bing with optional result scraping. |
agent_extract |
Structured data extraction by JSON schema via Ollama, Anthropic Claude, or CSS fallback. |
agent_status |
Poll crawl job progress (in-memory or Redis). |
Library API
| Function | Description |
|---|---|
smart_fetch(url, config=) |
Fetch a single URL; auto-detects browser need. Returns FetchResult. |
batch_fetch(urls, concurrency=) |
Fetch multiple URLs concurrently. Returns list[FetchResult]. |
search_fetch(query, sources=, max_results=) |
Search and optionally scrape results. Returns SearchResult. |
parallel_search(query, sources=, max_results=) |
Search engine results without scraping. Returns tuple[list[EngineResult], list[str], dict[str, str]]. |
Quickstart
LangChain
from agentfetch.integrations.langchain.tools import AgentFetchTools
tools = AgentFetchTools
# Use with any LangChain agent
MCP (Claude Desktop, Cursor, etc.)
pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch-mcp
# configure in Claude Desktop or any MCP host
REST API
pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch serve
curl -X POST http://localhost:8080/agent_scrape \
-d '{"url": "https://example.com"}'
Python library
import asyncio
from agentfetch import smart_fetch, search_fetch
from agentfetch.core.schema import ScrapeConfig
# Fetch a single URL
result = asyncio.run(smart_fetch(
"https://en.wikipedia.org/wiki/Obsession_(2025_film)",
config=ScrapeConfig(
wait_for=".main-content",
exclude_tags=["nav", "footer"],
citation_links=True,
)
))
print(result.content) # clean markdown
print(result.citations) # [1], [2] URLs
# Search with multiple engines
sr = asyncio.run(search_fetch(
"latest AI news",
sources=["duckduckgo", "google", "bing"],
max_results=5,
))
print(sr.results) # list[FetchResult]
print(sr.errors) # per-engine errors, e.g. {"google": "rate limited (429)"}
print(sr.sources_used) # engines that returned results
All integrations
| Framework | Install | Import |
|---|---|---|
| LangChain | pip install "agentfetch[langchain] @ git+https://github.com/SID1ART/agentfetch.git" |
from agentfetch.integrations.langchain.tools import AgentFetchTools |
| LlamaIndex | pip install "agentfetch[llamaindex] @ git+https://github.com/SID1ART/agentfetch.git" |
from agentfetch.integrations.llamaindex.tools import AgentFetchToolSpec |
| CrewAI | pip install "agentfetch[crewai] @ git+https://github.com/SID1ART/agentfetch.git" |
from agentfetch.integrations.crewai.tools import scrape_tool |
| AutoGen | pip install git+https://github.com/SID1ART/agentfetch.git |
from agentfetch.integrations.openai.tools import get_tools |
| OpenAI / Gemini / Groq | pip install git+https://github.com/SID1ART/agentfetch.git |
from agentfetch.integrations.openai.tools import get_tools |
| Claude MCP | pip install git+https://github.com/SID1ART/agentfetch.git |
agentfetch-mcp |
| Ollama | pip install git+https://github.com/SID1ART/agentfetch.git |
from agentfetch.integrations.ollama.tools import ollama_extract |
| REST | pip install git+https://github.com/SID1ART/agentfetch.git |
agentfetch serve |
Schema reference
ScrapeConfig
| Field | Type | Default | Description |
|---|---|---|---|
wait_for |
str |
None |
CSS selector to wait for before extracting |
include_tags |
list[str] |
None |
Only extract these HTML tags |
exclude_tags |
list[str] |
None |
Skip these HTML tags during extraction |
viewport |
dict |
None |
Browser viewport {width, height} |
js_wait_ms |
int |
0 |
Extra JS wait time in milliseconds |
scrape_links |
bool |
True |
Extract links from page |
max_content_length |
int |
50000 |
Truncate content beyond this length |
citation_links |
bool |
False |
Track citation markers [1], [2] |
proxy |
str |
None |
Proxy URL for this request |
cookies |
list[dict] |
None |
Cookies to include in browser session |
headers |
dict[str,str] |
None |
Custom HTTP headers |
ja3 |
str |
None |
JA3 TLS profile for curl_cffi bypass (e.g. "chrome124") |
FetchResult
| Field | Type | Description |
|---|---|---|
url |
str |
Requested URL |
content |
str |
Extracted markdown content |
title |
str |
Page title |
confidence |
float |
Extraction quality (0.0–1.0) |
content_type |
str |
Detected type (article, blog, product, etc.) |
word_count |
int |
Word count of extracted content |
render_mode |
str |
Renderer used: static, browser, or bypass |
latency_ms |
int |
Total request time in milliseconds |
cached |
bool |
Whether result came from cache |
injection_detected |
bool |
Prompt injection was found and redacted |
links |
list[str] |
Links extracted from the page |
error |
str |
Error message if the fetch failed |
duplicate_of |
str |
URL this content was deduplicated against |
retries |
int |
Number of retries performed |
citations |
list[str] |
Citation URLs when citation_links=True |
robots_allowed |
bool |
Whether robots.txt permitted the fetch |
proxy_used |
str |
Proxy used for this request |
normalized_url |
str |
Normalized version of the requested URL |
SearchConfig
| Field | Type | Default | Description |
|---|---|---|---|
max_results |
int |
5 |
Max results per engine |
sources |
list[str] |
None |
Engines: duckduckgo, google, bing, searxng |
scrape_results |
bool |
True |
Fetch full content of each result |
searxng_url |
str |
"" |
Self-hosted SearXNG instance URL |
SearchResult
| Field | Type | Description |
|---|---|---|
query |
str |
Original search query |
results |
list[FetchResult] |
Search results with extracted content |
source |
str |
Concatenated engine names used |
sources_used |
list[str] |
Engines that returned results |
suggestions |
list[str] |
Search suggestions (if available) |
total_results |
int |
Total deduplicated result count |
errors |
dict[str,str] |
Per-engine error messages (e.g. {"google": "rate limited (429)"}) |
Configuration
Environment variables
| Variable | Default | Description |
|---|---|---|
REDIS_URL |
— | Redis connection for caching + job queue |
SEARXNG_URL |
— | SearXNG instance for search (falls back to DuckDuckGo + Google + Bing) |
ANTHROPIC_API_KEY |
— | For Claude-powered agent_extract |
OLLAMA_URL |
— | Ollama endpoint for local LLM extraction |
OLLAMA_MODEL |
llama3.2 |
Ollama model name |
AGENTFETCH_CACHE_TTL |
3600 |
Cache TTL in seconds |
AGENTFETCH_STATIC_TIMEOUT |
15 |
HTTP fetch timeout (seconds) |
AGENTFETCH_BROWSER_TIMEOUT |
30 |
Playwright browser timeout (seconds) |
AGENTFETCH_MAX_RETRIES |
2 |
Max retries for failed requests |
AGENTFETCH_DOMAIN_DELAY |
0.5 |
Delay between requests to same domain |
AGENTFETCH_ROBOTS_CHECK |
false |
Enable robots.txt compliance |
AGENTFETCH_PROXY_LIST |
— | Comma-separated proxy URLs or JSON array |
AGENTFETCH_PROXY_STRATEGY |
round-robin |
round-robin or random |
AGENTFETCH_COOKIES_FILE |
— | Path to cookies file (Netscape or JSON) |
AGENTFETCH_PORT |
8080 |
API server port |
AGENTFETCH_JA3_PROFILE |
— | JA3 TLS profile override for curl_cffi |
Self-host
docker-compose up -d
# Starts API (port 8080), MCP SSE (port 8081), Redis
# Optional crawl worker:
docker compose --profile worker up -d
Architecture
┌─────────────┐
│ Smart │
│ URL │
│ Router │
└──────┬──────┘
│
┌─────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌────────────────┐
│ Static │ │ Cloudflare │ │ Playwright │
│ HTTP │ │ bypass │ │ Headless │
│ (httpx) │ │ (curl_cffi) │ │ Browser │
└─────┬──────┘ └──────┬───────┘ └───────┬────────┘
│ │ │
└─────────────────┼────────────────────┘
│
▼
┌─────────────────┐
│ Extraction │
│ Pipeline │
│ trafilatura → │
│ newspaper3k → │
│ readability → │
│ BS4 → plain │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Sanitizer │
│ (13 injection │
│ patterns) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Post-process │
│ • Citations │
│ • Dedup check │
│ • Max length │
│ • Markdown │
└────────┬────────┘
│
▼
┌─────────────────┐
│ FetchResult │
│ Pydantic │
│ response │
└─────────────────┘
Tests
pip install -e ".[all]"
pytest tests/ -v
# 98 tests passing
License
MIT — free for any use, including commercial.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.