uspto-crawler-mcp

uspto-crawler-mcp

A specialized web crawler and scraper for the US Patent and Trademark Office (USPTO) website, built with Crawl4AI and MCP integration, enabling patent and trademark search, extraction, and status checking.

Category
Visit Server

README

USPTO Crawler MCP - Patent & Trademark Web Scraper

Author: Yobie Benjamin
Version: 0.2
Date: July 28, 2025

Overview

A specialized web crawler and scraper for the US Patent and Trademark Office (USPTO) website, built with Crawl4AI and MCP (Model Context Protocol) integration. This tool makes it easy to search, extract, and analyze patent and trademark data from the notoriously difficult-to-navigate USPTO databases. This is work-in-progress and is far from perfect. It will require some tuning especially on the frontend interface. I would be thrilled if the opensource community will contribute some tweaks to improve the overall app. Thank you.

Current Status (August 2025):

  • The application uses enhanced mock data with realistic patent information for demonstration
  • PatentsView API v1 has been deprecated (returns 410 Gone)
  • Google Patents API blocks automated requests (503 errors)
  • Direct USPTO website access requires complex session handling
  • NEW: Selenium WebDriver integration for browser automation (experimental)
  • The infrastructure is fully functional and ready for integration when APIs become available. In experiments, current APIs expect an "account" and reject scrapes.
  • Mock data includes realistic patents for AI, blockchain, quantum computing, and more

Features

🔍 Smart USPTO Navigation

  • Patent Search: Search PatFT (granted patents) and AppFT (applications)
  • Trademark Search: Query TESS database with advanced filters
  • Status Checking: Real-time application status via PAIR and TSDR
  • Bulk Extraction: Process multiple patents/trademarks efficiently

🤖 Crawl4AI Integration

  • AI-Powered Extraction: Uses LLM to understand complex USPTO pages
  • Smart Content Detection: Automatically identifies patent vs trademark content
  • Adaptive Crawling: Adjusts strategy based on page structure
  • Rate Limiting: Respectful crawling to avoid overwhelming USPTO servers

💻 User-Friendly Interface

  • React Frontend: Modern, responsive web interface
  • Real-time Updates: WebSocket connection for progress tracking
  • Export Options: Download results as JSON, CSV, or Excel
  • Search History: Track and replay previous searches

🔌 MCP Integration

  • Claude Desktop Compatible: Use directly from Claude via MCP
  • Standalone API: REST API for programmatic access
  • WebSocket Support: Real-time streaming of results

Installation

Prerequisites

  • Node.js 18+ and npm
  • Python 3.8+ with pip
  • Chrome/Chromium (for Crawl4AI)

Quick Start

# Clone the repository
git clone https://github.com/yobieben/uspto-crawler-mcp.git
cd uspto-crawler-mcp

# Install Node dependencies
npm install

# Install Python dependencies (Crawl4AI)
pip install crawl4ai playwright
playwright install chromium

# Install frontend dependencies
cd frontend
npm install
cd ..

# Start the application
npm run dev

The application will be available at:

  • Frontend: http://localhost:3000
  • Backend API: http://localhost:3001
  • MCP Server: Via stdio

MCP Configuration

Add to your Claude Desktop configuration:

{
  "mcpServers": {
    "uspto-crawler": {
      "command": "node",
      "args": ["/path/to/uspto-crawler-mcp/dist/mcp/index.js"],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}

Usage

Web Interface

  1. Search Patents/Trademarks:

    • Select search type (Patent or Trademark)
    • Enter search criteria
    • Click Search
    • View and export results
  2. Check Application Status:

    • Go to Status Check tab
    • Enter application/serial number
    • Select type (Patent/Trademark)
    • View current status
  3. Bulk Extraction:

    • Go to Bulk Extract tab
    • Upload list of numbers or URLs
    • Select extraction type
    • Download results when complete

MCP Tools

Available tools when using via Claude:

  • uspto_patent_search: Search patent databases
  • uspto_trademark_search: Search trademark databases
  • uspto_advanced_search: Combined search with multiple criteria
  • uspto_status_check: Check application status
  • uspto_bulk_extract: Extract data from multiple sources

API Endpoints

# Search patents
POST /api/patents/search
{
  "query": "artificial intelligence",
  "inventor": "John Doe",
  "dateFrom": "2020-01-01",
  "dateTo": "2025-01-01",
  "limit": 20
}

# Search trademarks
POST /api/trademarks/search
{
  "query": "NIKE",
  "owner": "Nike Inc",
  "status": "live",
  "limit": 20
}

# Check status
GET /api/status/patent/16123456
GET /api/status/trademark/88123456

# Bulk extraction
POST /api/extract/bulk
{
  "numbers": ["16123456", "16234567"],
  "extractType": "full",
  "format": "json"
}

USPTO Databases Supported

Patent Databases

  • PatFT: Full-Text Database (granted patents from 1976)
  • AppFT: Published Applications (from 2001)
  • Patent Public Search: New unified search system
  • PAIR: Patent Application Information Retrieval

Trademark Databases

  • TESS: Trademark Electronic Search System
  • TSDR: Trademark Status & Document Retrieval
  • ID Manual: Acceptable Identification of Goods/Services

Advanced Features

Selenium WebDriver Integration (NEW)

The application now includes experimental Selenium WebDriver support for browser automation:

Features:

  • Real Browser Automation: Uses actual Chrome/Firefox browsers to bypass bot detection
  • Anti-Detection Measures: Implements various techniques to avoid detection
    • Disables automation flags
    • Randomized delays to mimic human behavior
    • Natural scrolling patterns
    • Custom user agent strings
  • Fallback System: Automatically falls back to mock data if scraping fails
  • Headless Mode: Can run with or without visible browser window

Setup:

# Install ChromeDriver
npm install -g chromedriver

# Enable Selenium in your environment
export USE_SELENIUM=true
export SELENIUM_HEADLESS=false  # Set to true for headless mode

# Run the application
npm run dev

Testing Selenium:

# Test Selenium functionality
npx tsx test-selenium.ts

# Test with API
USE_SELENIUM=true curl -X POST http://localhost:3001/api/patents/search \
  -H "Content-Type: application/json" \
  -d '{"query": "artificial intelligence", "limit": 5}'

Current Limitations:

  • USPTO websites have strong anti-bot measures
  • Google Patents blocks automated requests after detection
  • May require proxy rotation for production use
  • Performance is slower than API-based approaches

Smart Crawling Strategies

The crawler automatically adapts to different USPTO page types:

  1. Patent Pages:

    • Extracts patent number, title, abstract
    • Captures inventors, assignees, claims
    • Downloads PDFs when available
  2. Trademark Pages:

    • Extracts mark text, owner information
    • Captures goods/services descriptions
    • Downloads mark images
  3. Search Results:

    • Paginates through results automatically
    • Maintains session for complex searches
    • Handles CAPTCHA detection

Data Extraction Options

  • Summary: Basic information only
  • Full: Complete document with all sections
  • Status: Current status and prosecution history
  • Custom: Specify exact fields to extract

Export Formats

  • JSON: Structured data with all fields
  • CSV: Tabular format for spreadsheets
  • Excel: Formatted workbook with multiple sheets
  • PDF: Formatted reports (coming soon)

Configuration

Environment Variables

# Server Configuration
PORT=3001
LOG_LEVEL=info

# Selenium Configuration (NEW)
USE_SELENIUM=true              # Enable Selenium-based scraping
SELENIUM_HEADLESS=false        # Run browser in headless mode
SELENIUM_BROWSER=chrome        # Browser to use (chrome/firefox)

# Crawl4AI Configuration
CRAWL4AI_HEADLESS=true
CRAWL4AI_TIMEOUT=30000
CRAWL4AI_USER_AGENT="USPTO-Crawler/0.2"

# AI Configuration (for LLM extraction)
OPENAI_API_KEY=your_api_key  # Optional
LLM_PROVIDER=openai
LLM_MODEL=gpt-4

# Rate Limiting
MAX_CONCURRENT_CRAWLS=5
DELAY_BETWEEN_REQUESTS=1000

Custom Extraction Rules

Create custom extraction rules in config/extraction-rules.json:

{
  "patent": {
    "selectors": {
      "title": "h1.patent-title",
      "abstract": "div.abstract",
      "claims": "div.claims"
    }
  },
  "trademark": {
    "selectors": {
      "mark": "div.mark-text",
      "owner": "span.owner-name",
      "status": "div.status-container"
    }
  }
}

Troubleshooting

Common Issues

Crawl4AI Installation Failed

# Install with specific version
pip install crawl4ai==0.3.0

# Install Playwright browsers
playwright install chromium

USPTO Rate Limiting

  • The crawler includes automatic rate limiting
  • If blocked, wait 15 minutes before retrying
  • Consider using proxy rotation for large-scale extraction

No Results Found

  • USPTO search syntax is specific
  • Try simpler queries first
  • Check date ranges are valid
  • Verify classification codes

Debug Mode

Enable debug logging:

LOG_LEVEL=debug npm run dev

View Crawl4AI logs:

tail -f logs/crawl4ai.log

Performance

Optimization Tips

  1. Batch Processing: Use bulk extraction for multiple items
  2. Caching: Results are cached for 24 hours
  3. Parallel Crawling: Up to 5 concurrent crawls by default
  4. Smart Routing: AI determines optimal extraction strategy

Benchmarks

  • Single patent search: ~2-3 seconds
  • Bulk extraction (100 items): ~5 minutes
  • Full patent download: ~5 seconds
  • Status check: ~1 second

Legal Notice

This tool is for educational and research purposes. Please:

  • Respect USPTO's terms of service
  • Use reasonable rate limiting
  • Don't overwhelm USPTO servers
  • Cite USPTO as the data source

Contributing

Contributions are welcome! Areas for improvement:

  • Additional extraction strategies
  • Support for more USPTO databases
  • Enhanced AI extraction rules
  • Performance optimizations
  • UI/UX improvements

License

MIT License - See LICENSE file

Support

Acknowledgments

  • Crawl4AI for the amazing crawling framework
  • USPTO for providing public access to patent and trademark data
  • Anthropic for the MCP protocol
  • The open-source community

Built with ❤️ by Yobie Benjamin
Making USPTO data accessible to everyone

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured