agentfetch

agentfetch

An open-source web retrieval MCP server that fetches, crawls, and searches the web, returning clean markdown for AI agents. It integrates with Claude MCP, LangChain, and other frameworks for agentic web access.

Category
Visit Server

README

agentfetch

Open-source web retrieval built for AI agents.

License: MIT Python 3.10+ Tests

agentfetch is a free, local alternative to Firecrawl, Exa, and Parallel.ai. It fetches any webpage, crawls any site, and searches the web — returning clean markdown that AI agents can consume directly.

Works with LangChain, LlamaIndex, CrewAI, AutoGen, Claude MCP, OpenAI function calling, Gemini, Groq, and plain REST. No vendor lock-in, no API keys required.

Install

Standard

pip install git+https://github.com/SID1ART/agentfetch.git

Cloud notebooks (Colab, Jupyter, Kaggle)

pip install https://github.com/SID1ART/agentfetch/archive/main.zip

With extra integrations

pip install "agentfetch[langchain,llamaindex,crewai] @ git+https://github.com/SID1ART/agentfetch.git"
pip install "agentfetch[search] @ git+https://github.com/SID1ART/agentfetch.git"   # adds Google search engine

No PyPI account, no API tokens, no sign-up needed. GitHub is the source.

What makes it different

  • Smart Mode Router — detects JavaScript-heavy SPAs (Next.js, Nuxt, React) and falls back to Playwright headless browser automatically. Static pages use direct HTTP.
  • 5-layer extraction pipeline — trafilatura → newspaper3k → readability-lxml → BeautifulSoup → plain text. Best-effort extraction from any HTML.
  • Never raises exceptions — always returns structured FetchResult with confidence scores, error fields, and injection detection. Agents can trust the output.
  • Information saturation crawling — no arbitrary depth limits. CrawlStopper detects vocabulary saturation and content redundancy, stopping when enough data is gathered.
  • Prompt injection firewall — 13 patterns detected and redacted to [REDACTED BY AGENTFETCH].
  • Cloudflare bypass — optional curl_cffi integration with 12 TLS fingerprint profiles (Chrome 99–124, Safari 15/17) and auto-rotation.
  • Robots.txt compliance — optional async parser with caching, crawl-delay, and sitemap discovery.
  • Proxy rotation — round-robin or random proxy pools with automatic failure tracking.
  • Local LLM extraction — optional Ollama integration for structured data extraction without API costs.
  • Redis-backed job queue — horizontal scaling for crawl operations with background workers.

Tools

Tool Description
agent_scrape Fetch any URL; auto-detects browser need. Supports ScrapeConfig (wait_for selectors, tag filtering, citation markers, proxies, JA3 profile).
agent_crawl Recursive crawl with information saturation stopping, robots.txt compliance, deduplication.
agent_search Web search via SearXNG, DuckDuckGo, Google, or Bing with optional result scraping.
agent_extract Structured data extraction by JSON schema via Ollama, Anthropic Claude, or CSS fallback.
agent_status Poll crawl job progress (in-memory or Redis).

Library API

Function Description
smart_fetch(url, config=) Fetch a single URL; auto-detects browser need. Returns FetchResult.
batch_fetch(urls, concurrency=) Fetch multiple URLs concurrently. Returns list[FetchResult].
search_fetch(query, sources=, max_results=) Search and optionally scrape results. Returns SearchResult.
parallel_search(query, sources=, max_results=) Search engine results without scraping. Returns tuple[list[EngineResult], list[str], dict[str, str]].

Quickstart

LangChain

from agentfetch.integrations.langchain.tools import AgentFetchTools
tools = AgentFetchTools
# Use with any LangChain agent

MCP (Claude Desktop, Cursor, etc.)

pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch-mcp
# configure in Claude Desktop or any MCP host

REST API

pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch serve
curl -X POST http://localhost:8080/agent_scrape \
  -d '{"url": "https://example.com"}'

Python library

import asyncio
from agentfetch import smart_fetch, search_fetch
from agentfetch.core.schema import ScrapeConfig

# Fetch a single URL
result = asyncio.run(smart_fetch(
    "https://en.wikipedia.org/wiki/Obsession_(2025_film)",
    config=ScrapeConfig(
        wait_for=".main-content",
        exclude_tags=["nav", "footer"],
        citation_links=True,
    )
))
print(result.content)  # clean markdown
print(result.citations)  # [1], [2] URLs

# Search with multiple engines
sr = asyncio.run(search_fetch(
    "latest AI news",
    sources=["duckduckgo", "google", "bing"],
    max_results=5,
))
print(sr.results)      # list[FetchResult]
print(sr.errors)       # per-engine errors, e.g. {"google": "rate limited (429)"}
print(sr.sources_used) # engines that returned results

All integrations

Framework Install Import
LangChain pip install "agentfetch[langchain] @ git+https://github.com/SID1ART/agentfetch.git" from agentfetch.integrations.langchain.tools import AgentFetchTools
LlamaIndex pip install "agentfetch[llamaindex] @ git+https://github.com/SID1ART/agentfetch.git" from agentfetch.integrations.llamaindex.tools import AgentFetchToolSpec
CrewAI pip install "agentfetch[crewai] @ git+https://github.com/SID1ART/agentfetch.git" from agentfetch.integrations.crewai.tools import scrape_tool
AutoGen pip install git+https://github.com/SID1ART/agentfetch.git from agentfetch.integrations.openai.tools import get_tools
OpenAI / Gemini / Groq pip install git+https://github.com/SID1ART/agentfetch.git from agentfetch.integrations.openai.tools import get_tools
Claude MCP pip install git+https://github.com/SID1ART/agentfetch.git agentfetch-mcp
Ollama pip install git+https://github.com/SID1ART/agentfetch.git from agentfetch.integrations.ollama.tools import ollama_extract
REST pip install git+https://github.com/SID1ART/agentfetch.git agentfetch serve

Schema reference

ScrapeConfig

Field Type Default Description
wait_for str None CSS selector to wait for before extracting
include_tags list[str] None Only extract these HTML tags
exclude_tags list[str] None Skip these HTML tags during extraction
viewport dict None Browser viewport {width, height}
js_wait_ms int 0 Extra JS wait time in milliseconds
scrape_links bool True Extract links from page
max_content_length int 50000 Truncate content beyond this length
citation_links bool False Track citation markers [1], [2]
proxy str None Proxy URL for this request
cookies list[dict] None Cookies to include in browser session
headers dict[str,str] None Custom HTTP headers
ja3 str None JA3 TLS profile for curl_cffi bypass (e.g. "chrome124")

FetchResult

Field Type Description
url str Requested URL
content str Extracted markdown content
title str Page title
confidence float Extraction quality (0.0–1.0)
content_type str Detected type (article, blog, product, etc.)
word_count int Word count of extracted content
render_mode str Renderer used: static, browser, or bypass
latency_ms int Total request time in milliseconds
cached bool Whether result came from cache
injection_detected bool Prompt injection was found and redacted
links list[str] Links extracted from the page
error str Error message if the fetch failed
duplicate_of str URL this content was deduplicated against
retries int Number of retries performed
citations list[str] Citation URLs when citation_links=True
robots_allowed bool Whether robots.txt permitted the fetch
proxy_used str Proxy used for this request
normalized_url str Normalized version of the requested URL

SearchConfig

Field Type Default Description
max_results int 5 Max results per engine
sources list[str] None Engines: duckduckgo, google, bing, searxng
scrape_results bool True Fetch full content of each result
searxng_url str "" Self-hosted SearXNG instance URL

SearchResult

Field Type Description
query str Original search query
results list[FetchResult] Search results with extracted content
source str Concatenated engine names used
sources_used list[str] Engines that returned results
suggestions list[str] Search suggestions (if available)
total_results int Total deduplicated result count
errors dict[str,str] Per-engine error messages (e.g. {"google": "rate limited (429)"})

Configuration

Environment variables

Variable Default Description
REDIS_URL Redis connection for caching + job queue
SEARXNG_URL SearXNG instance for search (falls back to DuckDuckGo + Google + Bing)
ANTHROPIC_API_KEY For Claude-powered agent_extract
OLLAMA_URL Ollama endpoint for local LLM extraction
OLLAMA_MODEL llama3.2 Ollama model name
AGENTFETCH_CACHE_TTL 3600 Cache TTL in seconds
AGENTFETCH_STATIC_TIMEOUT 15 HTTP fetch timeout (seconds)
AGENTFETCH_BROWSER_TIMEOUT 30 Playwright browser timeout (seconds)
AGENTFETCH_MAX_RETRIES 2 Max retries for failed requests
AGENTFETCH_DOMAIN_DELAY 0.5 Delay between requests to same domain
AGENTFETCH_ROBOTS_CHECK false Enable robots.txt compliance
AGENTFETCH_PROXY_LIST Comma-separated proxy URLs or JSON array
AGENTFETCH_PROXY_STRATEGY round-robin round-robin or random
AGENTFETCH_COOKIES_FILE Path to cookies file (Netscape or JSON)
AGENTFETCH_PORT 8080 API server port
AGENTFETCH_JA3_PROFILE JA3 TLS profile override for curl_cffi

Self-host

docker-compose up -d
# Starts API (port 8080), MCP SSE (port 8081), Redis
# Optional crawl worker:
docker compose --profile worker up -d

Architecture

                         ┌─────────────┐
                         │   Smart     │
                         │   URL       │
                         │   Router    │
                         └──────┬──────┘
                                │
              ┌─────────────────┼──────────────────┐
              │                 │                   │
              ▼                 ▼                   ▼
      ┌────────────┐   ┌──────────────┐   ┌────────────────┐
      │  Static    │   │  Cloudflare  │   │   Playwright   │
      │  HTTP      │   │  bypass      │   │   Headless     │
      │  (httpx)   │   │  (curl_cffi) │   │   Browser      │
      └─────┬──────┘   └──────┬───────┘   └───────┬────────┘
            │                 │                    │
            └─────────────────┼────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Extraction     │
                    │  Pipeline       │
                    │  trafilatura →  │
                    │  newspaper3k →  │
                    │  readability →  │
                    │  BS4 → plain    │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Sanitizer      │
                    │  (13 injection  │
                    │   patterns)     │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Post-process   │
                    │  • Citations    │
                    │  • Dedup check  │
                    │  • Max length   │
                    │  • Markdown     │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │   FetchResult   │
                    │   Pydantic      │
                    │   response      │
                    └─────────────────┘

Tests

pip install -e ".[all]"
pytest tests/ -v
# 98 tests passing

License

MIT — free for any use, including commercial.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured