MCP Server Steel Scraper
An MCP server that wraps the steel-dev API to enable AI agents to visit websites with browser automation, supporting both stateless scraping and stateful interactive sessions.
README
MCP Server Steel Scraper
A simple Model Context Protocol (MCP) server that wraps the steel-dev API for visiting websites with browser automation.
Quick Start
-
Install the package:
npm install -g @jharding_npm/mcp-server-steel-scraper -
Add to your MCP client configuration:
{ "mcpServers": { "steel-scraper": { "command": "npx", "args": ["@jharding_npm/mcp-server-steel-scraper", "--mode=both"], "env": { "STEEL_API_URL": "http://localhost:3000" } } }
}
3. **Start using the stateless `visit_with_browser` tool, or the stateful interactive tools.**
## Features
- **Dual Modes**: Run stateless scraping, stateful interaction, or both via `--mode=stateless|stateful|both` (default: `both`)
- **Stateless Tool**: `visit_with_browser` - Visit websites using steel-dev API
- **Stateful Tools**: Create sessions and interact with pages (navigate, click, type, scroll, snapshot)
- **Flexible Return Types**: HTML, markdown, readability, or cleaned HTML
- **Local/Remote Support**: Works with local or remote steel-dev instances
- **Browser Automation**: Screenshot capture, PDF generation, proxy support
- **Smart Length Management**: Single `maxLength` parameter with intelligent defaults and automatic content/metadata split
- **Clean Output by Default**: Minimal metadata output perfect for 7B models and summarization
- **Verbose Mode**: Optional full metadata when detailed information is needed
- **TypeScript**: Fully typed implementation
## Installation
### Option 1: NPM Package (Recommended)
Install the package globally to use it with npx:
```bash
npm install -g @jharding_npm/mcp-server-steel-scraper
Or use it directly with npx without installing:
npx @jharding_npm/mcp-server-steel-scraper
Option 2: Local Development
- Clone this repository:
git clone <repository-url>
cd mcp-server-steel-scraper
- Install dependencies:
npm install
- Build the project:
npm run build
Configuration
The server uses environment variables for configuration:
STEEL_API_URL: The steel-dev API endpoint (default:http://localhost:3000)STEEL_TIMEOUT: Request timeout in milliseconds (default:30000)STEEL_RETRIES: Number of retry attempts (default:3)STEEL_LOCAL: Set totruewhen using a local Steel instance for stateful sessionsSTEEL_BASE_URL: Base URL for the Steel Sessions API (default:https://api.steel.dev, orhttp://localhost:3000whenSTEEL_LOCAL=true)STEEL_API_KEY: Required for cloud mode stateful sessionsSTEEL_SESSION_TIMEOUT_MS: Session timeout in milliseconds (default:900000)STEEL_GLOBAL_WAIT_SECONDS: Optional delay after each stateful action (default:0)STEEL_IDLE_TIMEOUT_MS: Auto-release idle sessions after this many milliseconds (default:600000, set to0to disable)
Copy env.example to .env and modify as needed:
cp env.example .env
Usage
Running the Server
# Development mode
npm run dev
# Auto-rebuild on changes (recommended for npm link workflows)
npm run build:watch
# Production mode
npm start
# Only stateless scraping tools
npm start -- --mode=stateless
# Only stateful interactive tools
npm start -- --mode=stateful
# Both tool sets (default)
npm start -- --mode=both
MCP Client Configuration
Add this server to your MCP client configuration. Here are examples for popular LLM clients:
For Claude Desktop / Cline / Other MCP Clients (NPM Package)
{
"mcpServers": {
"steel-scraper": {
"command": "npx",
"args": ["@jharding_npm/mcp-server-steel-scraper", "--mode=stateless"],
"env": {
"STEEL_API_URL": "http://localhost:3000"
}
}
}
}
To expose the stateful interactive tools, add --mode=stateful or --mode=both to the args array.
For Continue.dev (NPM Package)
{
"mcpServers": {
"steel-scraper": {
"command": "npx",
"args": ["@jharding_npm/mcp-server-steel-scraper", "--mode=stateless"],
"env": {
"STEEL_API_URL": "http://localhost:3000"
}
}
}
}
For Cursor IDE (NPM Package)
{
"mcpServers": {
"steel-scraper": {
"command": "npx",
"args": ["@jharding_npm/mcp-server-steel-scraper", "--mode=stateless"],
"env": {
"STEEL_API_URL": "http://localhost:3000"
}
}
}
}
For Remote Steel-dev Instance (NPM Package)
{
"mcpServers": {
"steel-scraper": {
"command": "npx",
"args": ["@jharding_npm/mcp-server-steel-scraper", "--mode=stateless"],
"env": {
"STEEL_API_URL": "https://your-steel-dev-instance.com"
}
}
}
}
Alternative: Using Global Installation
If you've installed the package globally with npm install -g @jharding_npm/mcp-server-steel-scraper, you can use:
{
"mcpServers": {
"steel-scraper": {
"command": "mcp-server-steel-scraper",
"env": {
"STEEL_API_URL": "http://localhost:3000"
}
}
}
}
For Local Development (using absolute path)
{
"mcpServers": {
"steel-scraper": {
"command": "node",
"args": ["/path/to/mcp-server-steel-scraper/dist/index.js"],
"env": {
"STEEL_API_URL": "http://localhost:3000"
}
}
}
}
Tool Usage
The server provides one tool: visit_with_browser
Parameters
url(required): The URL to visitformat(optional): Content formats to extract -["html"]for raw HTML source (may be very large),["markdown"]for clean formatted text converted from HTML (recommended for reading),["readability"]for Mozilla Readability format,["cleaned_html"]for cleaned HTML. You can request multiple formats (default:["markdown"])screenshot(optional): Take a screenshot of the page (returns base64 encoded image) (default:false)pdf(optional): Generate a PDF of the page (returns base64 encoded PDF) (default:false)proxyUrl(optional): Proxy URL to use for the request (e.g.,"http://proxy:port")delay(optional): Delay in seconds to wait after page load before scraping (default:0)logUrl(optional): URL to send logs to for debugging purposesmaxLength(optional): Maximum characters to return. Smart defaults: markdown=8000, readability=10000, html=15000, cleaned_html=12000. For markdown, automatically reserves space for metadataverboseMode(optional): Return full metadata instead of clean content-focused output (default: false). Use when you need detailed visit information
Example Usage
// Basic website visit
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://example.com"
}
}
// Advanced visit with multiple formats
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://example.com",
"format": ["markdown", "html"],
"screenshot": true,
"delay": 2
}
}
// Simple visit with smart defaults (perfect for 7B models)
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://example.com",
"format": ["markdown"]
}
}
// Custom length limit (automatically handles content vs metadata split)
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://en.wikipedia.org/wiki/Long_Article",
"format": ["markdown"],
"maxLength": 5000
}
}
// Verbose mode when you need detailed visit information
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://example.com",
"format": ["markdown"],
"maxLength": 8000,
"verboseMode": true
}
}
// With proxy and PDF generation
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://example.com",
"format": ["readability"],
"pdf": true,
"proxyUrl": "http://proxy:8080"
}
}
Stateful Interactive Tools
When running with --mode=stateful or --mode=both, the server exposes stateful tools that let the LLM interact with a live page.
Stateful sessions are created via the Steel Sessions API and connected over CDP (Chrome DevTools Protocol).
Available Tools
session_create- Create a new Steel session and connectsession_release- Release the current sessionnavigate- Navigate to a URLsearch- Open Google search results for a queryclick- Click an element by labeltype- Type into an element by labelscroll_down/scroll_up- Scroll the pagego_back- Navigate backwait- Wait a few seconds for dynamic contentsnapshot- Annotated screenshot + labels listsnapshot_unmarked- Screenshot without labelspage_content- Return page HTML or text
Example Session
// Create a session
{
"tool": "session_create",
"arguments": { "timeoutMs": 900000 }
}
// Navigate
{
"tool": "navigate",
"arguments": { "url": "https://example.com" }
}
// Get an annotated snapshot (labels + image)
{
"tool": "snapshot",
"arguments": {}
}
// Click a labeled element
{
"tool": "click",
"arguments": { "label": 3 }
}
// Type into a labeled input
{
"tool": "type",
"arguments": { "label": 5, "text": "hello", "replaceText": true }
}
Smart Length Management
The server automatically handles content length optimization:
- Unified Length Control: Single
maxLengthparameter handles both content and metadata - Automatic Content/Metadata Split: For markdown, reserves 10% for metadata, uses 90% for content
- Smart Defaults: Reasonable defaults when no length is specified (markdown=8000, text=10000, html=15000, json=5000)
- Better Truncation: Avoids double-truncation issues that could result in incomplete content
- Conversion Detection: Automatically detects when HTML-to-markdown conversion may have failed
- Warning System: Provides warnings when content appears truncated or incomplete
How It Works
// Simple usage - uses smart defaults
{
"url": "https://example.com",
"format": ["markdown"]
// Automatically uses 8000 characters, reserves 800 for metadata, 7200 for content
}
// Custom length - automatically splits appropriately
{
"url": "https://example.com",
"format": ["markdown"],
"maxLength": 5000
// Uses 5000 total, reserves 500 for metadata, 4500 for content
}
This approach ensures you get complete, properly formatted content while maintaining simple, intuitive parameter management.
Handling Large Pages (Like Amazon)
For large, complex pages like Amazon.com, follow these best practices:
Recommended Approach for Complex Pages
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://www.amazon.com",
"format": ["readability"], // Most reliable for complex pages
"maxLength": 5000, // Reasonable limit for large pages
"delay": 3 // Wait for main content to load
}
}
Format Comparison for Large Pages
- HTML: Returns raw HTML source (can be 900,000+ characters for Amazon)
- Readability: Mozilla Readability format (most reliable, good for complex pages)
- Markdown: Converts HTML to clean, readable text (may fail on complex pages like Amazon)
- Cleaned HTML: Cleaned HTML with better structure
Note: Markdown conversion may fail on complex, JavaScript-heavy pages like Amazon. Use ["readability"] for the most reliable results.
Troubleshooting
If you get HTML instead of Markdown:
- The steel-dev API may not support markdown conversion for that page type
- Try using
format: ["readability"]instead for better text extraction - Complex pages with heavy JavaScript may not convert properly
If you get truncated content:
- The page may be too large for the specified
maxLength - Try increasing
maxLengthor using a longerdelay - Consider using
format: ["readability"]for more reliable truncation
For Dynamic Content
Use delay parameter to wait for content to load:
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://www.amazon.com",
"format": ["markdown"],
"delay": 5, // Wait 5 seconds for content to load
"maxLength": 10000 // Longer content for complex pages
}
}
Clean Output by Default
The server is designed with 7B models in mind, providing clean, content-focused output by default:
- Content Summarization: Perfect for weaker models that need to summarize web content
- Content Analysis: Ideal for processing large amounts of text
- Context Optimization: Maximizes the content-to-metadata ratio automatically
How It Works
Default Mode (clean output):
# Article Title
This is the actual content...
Verbose Mode (verboseMode: true):
SUCCESS: Successfully scraped https://example.com
Method: full-browser-automation (stealth browser, anti-detection)
Format: markdown
Status Code: 200
Processing Time: 1250ms
Content Length: 5000 characters
Content Type: text/html
Timestamp: 2024-01-15T10:30:00.000Z
Title: Article Title
Description: Article description
Language: en
Screenshot: Available (base64)
Links Found: 15
SCRAPED CONTENT:
# Article Title
This is the actual content...
Benefits of Clean Output
- Maximum Content Space: Removes ~200-300 characters of metadata overhead
- Cleaner Output: Direct content without verbose headers
- Better for 7B Models: Focuses the model's attention on the actual content
- Preserves Warnings: Still shows important warnings if conversion issues occur
Recommended Usage
For summarization tasks, use the default clean output:
{
"tool": "visit_with_browser",
"arguments": {
"url": "https://article-to-summarize.com",
"format": ["markdown"],
"maxLength": 10000 // Automatically optimizes content vs metadata split
}
}
Steel-dev API Requirements
This MCP server expects a steel-dev API instance running with the following endpoints:
POST /scrape- Main scraping endpointGET /health- Health check endpoint (optional)GET /info- API information endpoint (optional)POST /v1/sessions- Create a stateful browser sessionPOST /v1/sessions/{id}/release- Release a stateful session
Expected Request Format
{
"url": "https://example.com",
"format": ["html", "markdown"],
"screenshot": true,
"pdf": false,
"proxyUrl": "http://proxy:8080",
"delay": 2,
"logUrl": "https://logs.example.com"
}
Expected Response Format
{
"content": {
"html": "<html>...</html>",
"markdown": "# Title\nContent..."
},
"metadata": {
"title": "Page Title",
"description": "Page description",
"statusCode": 200,
"timestamp": "2024-01-15T10:30:00.000Z"
},
"links": [
{"url": "https://example.com/link1", "text": "Link Text"}
],
"screenshot": "base64...",
"pdf": "base64..."
}
Development
Project Structure
src/
├── index.ts # Main MCP server implementation
├── steel-api.ts # Steel-dev API wrapper
└── config.ts # Configuration management
Scripts
npm run build- Build TypeScript to JavaScriptnpm run start- Run the built servernpm run dev- Run in development mode with tsx
Adding New Features
- Modify the tool schema in
src/index.ts - Update the
SteelAPIclass insrc/steel-api.tsif needed - Rebuild and test
Error Handling
The server includes comprehensive error handling:
- Network errors are caught and returned as error responses
- Invalid parameters are validated
- Steel-dev API errors are properly forwarded
- Timeout handling for long-running requests
License
MIT
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.