spectrawl

spectrawl

Unified web layer for AI agents. Search across 8 engines, stealth browse with Camoufox/Playwright, manage cookie auth, and act on 24 platforms.

Category
Visit Server

README

Spectrawl

The unified web layer for AI agents. Search, browse, authenticate, and act on platforms — one package, self-hosted.

5,000 free searches/month via Gemini Grounded Search. Full page scraping, stealth browsing, 24 platform adapters.

What It Does

AI agents need to interact with the web — searching, browsing pages, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.

npm install spectrawl

How It Works

Spectrawl searches via Gemini Grounded Search (Google-quality results), scrapes the top pages for full content, and returns everything to your agent. Your agent's LLM reads the actual sources and forms its own answer — no pre-chewed summaries.

Quick Start

npm install spectrawl
export GEMINI_API_KEY=your-free-key  # Get one at aistudio.google.com
const { Spectrawl } = require('spectrawl')
const web = new Spectrawl()

// Deep search — returns sources for your agent/LLM to process
const result = await web.deepSearch('how to build an MCP server in Node.js')
console.log(result.sources)   // [{ title, url, content, score }]

// With AI summary (opt-in — uses extra Gemini call)
const withAnswer = await web.deepSearch('query', { summarize: true })
console.log(withAnswer.answer)  // AI-generated answer with [1] [2] citations

// Fast mode — snippets only, skip scraping
const fast = await web.deepSearch('query', { mode: 'fast' })

// Basic search — raw results
const basic = await web.search('query')

Why no summary by default? Your agent already has an LLM. If we summarize AND your agent summarizes, you're paying two LLMs for one answer. We return rich sources — your agent does the rest.

Spectrawl vs Tavily

Different tools for different needs.

Tavily Spectrawl
Speed ~2s ~6-10s
Free tier 1,000/month 5,000/month
Returns Snippets + AI answer Full page content + snippets
Self-hosted No Yes
Stealth browsing No Yes (Camoufox + Playwright)
Platform posting No 24 adapters
Auth management No Cookie store + auto-refresh
Cached repeats No <1ms

Tavily is fast and simple — great for agents that need quick answers. Spectrawl returns richer data and does more (browse, auth, post) — but it's slower. Choose based on your use case.

Search

Default cascade: Gemini Grounded → Tavily → Brave

Gemini Grounded Search gives you Google-quality results through the Gemini API. Free tier: 5,000 grounded queries/month.

Engine Free Tier Key Required Default
Gemini Grounded 5,000/month GEMINI_API_KEY ✅ Primary
Tavily 1,000/month TAVILY_API_KEY ✅ 1st fallback
Brave 2,000/month BRAVE_API_KEY ✅ 2nd fallback
DuckDuckGo Unlimited None Available
Bing Unlimited None Available
Serper 2,500 trial SERPER_API_KEY Available
Google CSE 100/day GOOGLE_CSE_KEY Available
Jina Reader Unlimited None Available
SearXNG Unlimited Self-hosted Available

Deep Search Pipeline

Query → Gemini Grounded + DDG (parallel)
  → Merge & deduplicate (12-19 results)
  → Source quality ranking (boost GitHub/SO/Reddit, penalize SEO spam)
  → Parallel scraping (Jina → readability → Playwright fallback)
  → Returns sources to your agent (AI summary opt-in with summarize: true)

What you get without any keys

DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs get rate-limited by DDG — recommend at minimum a free Gemini key.

Browse

Stealth browsing with anti-detection. Three tiers (auto-detected):

  1. playwright-extra + stealth plugin — default, works immediately
  2. Camoufox binary — engine-level anti-fingerprint (npx spectrawl install-stealth)
  3. Remote Camoufox — for existing deployments
const page = await web.browse('https://example.com')
console.log(page.content)       // extracted text/markdown
console.log(page.screenshot)    // PNG buffer (if requested)

Auto-fallback: if Jina and readability return too little content (<200 chars), Spectrawl renders the page with Playwright and extracts from the rendered DOM. Tavily can't do this — they fail on JS-heavy pages.

Auth

Persistent cookie storage (SQLite), multi-account management, automatic expiry detection.

// Add account
await web.auth.add('x', { account: '@myhandle', method: 'cookie', cookies })

// Check health
const accounts = await web.auth.getStatus()
// [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]

Cookie refresh cron fires cookie_expiring and cookie_expired events before accounts go stale.

Act — 24 Platform Adapters

Post to 24+ platforms with one API:

await web.act('github', 'create-issue', { repo: 'user/repo', title: 'Bug report', body: '...' })
await web.act('reddit', 'post', { subreddit: 'node', title: '...', text: '...' })
await web.act('devto', 'post', { title: '...', body: '...', tags: ['ai'] })
await web.act('huggingface', 'create-repo', { name: 'my-model', type: 'model' })

Live tested: GitHub ✅, Reddit ✅, Dev.to ✅, HuggingFace ✅, X (reads) ✅

Platform Auth Method Actions
X/Twitter Cookie + OAuth 1.0a post
Reddit Cookie API post, comment, delete
Dev.to REST API key post, update
Hashnode GraphQL API post
LinkedIn Cookie API (Voyager) post
IndieHackers Browser automation post, comment
Medium REST API post
GitHub REST v3 repo, file, issue, release
Discord Bot API send, thread
Product Hunt GraphQL v2 launch, comment
Hacker News Cookie API submit, comment
YouTube Data API v3 comment
Quora Browser automation answer
HuggingFace Hub API repo, model card, upload
BetaList REST API submit
14 Directories Generic adapter submit

Built-in rate limiting, content dedup (MD5, 24h window), and dead letter queue for retries.

Source Quality Ranking

Spectrawl ranks results by domain trust — something Tavily doesn't do:

  • Boosted: GitHub, StackOverflow, HN, Reddit, MDN, arxiv, Wikipedia
  • Penalized: SEO farms, thin content sites, tag/category pages
  • Customizable: bring your own domain weights
const web = new Spectrawl({
  sourceRanker: {
    boost: ['github.com', 'news.ycombinator.com'],
    block: ['spamsite.com']
  }
})

HTTP Server

npx spectrawl serve --port 3900
POST /search   { "query": "...", "summarize": true }
POST /browse   { "url": "...", "screenshot": true }
POST /act      { "platform": "github", "action": "create-issue", ... }
GET  /status   — auth account health
GET  /health   — server health

MCP Server

Works with any MCP-compatible agent (Claude, Cursor, OpenClaw, LangChain):

npx spectrawl mcp

5 tools: web_search, web_browse, web_act, web_auth, web_status.

CLI

npx spectrawl init              # create spectrawl.json
npx spectrawl search "query"    # search from terminal
npx spectrawl status            # check auth health
npx spectrawl serve             # start HTTP server
npx spectrawl mcp               # start MCP server
npx spectrawl install-stealth   # download Camoufox browser

Configuration

spectrawl.json:

{
  "search": {
    "cascade": ["gemini-grounded", "tavily", "brave"],
    "scrapeTop": 5
  },
  "cache": {
    "searchTtl": 3600,
    "scrapeTtl": 86400
  },
  "rateLimit": {
    "x": { "postsPerHour": 3 },
    "reddit": { "postsPerHour": 5 }
  }
}

Environment Variables

GEMINI_API_KEY      Free — primary search + summarization (aistudio.google.com)
BRAVE_API_KEY       Brave Search (2,000 free/month)
SERPER_API_KEY      Serper.dev (2,500 trial queries)
GITHUB_TOKEN        For GitHub adapter
DEVTO_API_KEY       For Dev.to adapter
HF_TOKEN            For HuggingFace adapter
OPENAI_API_KEY      Alternative LLM for summarization
ANTHROPIC_API_KEY   Alternative LLM for summarization

License

MIT

Part of xanOS

Spectrawl is the web layer for xanOS — the autonomous content engine. Use it standalone or as part of the full stack.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured