uspto-crawler-mcp
A specialized web crawler and scraper for the US Patent and Trademark Office (USPTO) website, built with Crawl4AI and MCP integration, enabling patent and trademark search, extraction, and status checking.
README
USPTO Crawler MCP - Patent & Trademark Web Scraper
Author: Yobie Benjamin
Version: 0.2
Date: July 28, 2025
Overview
A specialized web crawler and scraper for the US Patent and Trademark Office (USPTO) website, built with Crawl4AI and MCP (Model Context Protocol) integration. This tool makes it easy to search, extract, and analyze patent and trademark data from the notoriously difficult-to-navigate USPTO databases. This is work-in-progress and is far from perfect. It will require some tuning especially on the frontend interface. I would be thrilled if the opensource community will contribute some tweaks to improve the overall app. Thank you.
Current Status (August 2025):
- The application uses enhanced mock data with realistic patent information for demonstration
- PatentsView API v1 has been deprecated (returns 410 Gone)
- Google Patents API blocks automated requests (503 errors)
- Direct USPTO website access requires complex session handling
- NEW: Selenium WebDriver integration for browser automation (experimental)
- The infrastructure is fully functional and ready for integration when APIs become available. In experiments, current APIs expect an "account" and reject scrapes.
- Mock data includes realistic patents for AI, blockchain, quantum computing, and more
Features
🔍 Smart USPTO Navigation
- Patent Search: Search PatFT (granted patents) and AppFT (applications)
- Trademark Search: Query TESS database with advanced filters
- Status Checking: Real-time application status via PAIR and TSDR
- Bulk Extraction: Process multiple patents/trademarks efficiently
🤖 Crawl4AI Integration
- AI-Powered Extraction: Uses LLM to understand complex USPTO pages
- Smart Content Detection: Automatically identifies patent vs trademark content
- Adaptive Crawling: Adjusts strategy based on page structure
- Rate Limiting: Respectful crawling to avoid overwhelming USPTO servers
💻 User-Friendly Interface
- React Frontend: Modern, responsive web interface
- Real-time Updates: WebSocket connection for progress tracking
- Export Options: Download results as JSON, CSV, or Excel
- Search History: Track and replay previous searches
🔌 MCP Integration
- Claude Desktop Compatible: Use directly from Claude via MCP
- Standalone API: REST API for programmatic access
- WebSocket Support: Real-time streaming of results
Installation
Prerequisites
- Node.js 18+ and npm
- Python 3.8+ with pip
- Chrome/Chromium (for Crawl4AI)
Quick Start
# Clone the repository
git clone https://github.com/yobieben/uspto-crawler-mcp.git
cd uspto-crawler-mcp
# Install Node dependencies
npm install
# Install Python dependencies (Crawl4AI)
pip install crawl4ai playwright
playwright install chromium
# Install frontend dependencies
cd frontend
npm install
cd ..
# Start the application
npm run dev
The application will be available at:
- Frontend: http://localhost:3000
- Backend API: http://localhost:3001
- MCP Server: Via stdio
MCP Configuration
Add to your Claude Desktop configuration:
{
"mcpServers": {
"uspto-crawler": {
"command": "node",
"args": ["/path/to/uspto-crawler-mcp/dist/mcp/index.js"],
"env": {
"LOG_LEVEL": "info"
}
}
}
}
Usage
Web Interface
-
Search Patents/Trademarks:
- Select search type (Patent or Trademark)
- Enter search criteria
- Click Search
- View and export results
-
Check Application Status:
- Go to Status Check tab
- Enter application/serial number
- Select type (Patent/Trademark)
- View current status
-
Bulk Extraction:
- Go to Bulk Extract tab
- Upload list of numbers or URLs
- Select extraction type
- Download results when complete
MCP Tools
Available tools when using via Claude:
uspto_patent_search: Search patent databasesuspto_trademark_search: Search trademark databasesuspto_advanced_search: Combined search with multiple criteriauspto_status_check: Check application statususpto_bulk_extract: Extract data from multiple sources
API Endpoints
# Search patents
POST /api/patents/search
{
"query": "artificial intelligence",
"inventor": "John Doe",
"dateFrom": "2020-01-01",
"dateTo": "2025-01-01",
"limit": 20
}
# Search trademarks
POST /api/trademarks/search
{
"query": "NIKE",
"owner": "Nike Inc",
"status": "live",
"limit": 20
}
# Check status
GET /api/status/patent/16123456
GET /api/status/trademark/88123456
# Bulk extraction
POST /api/extract/bulk
{
"numbers": ["16123456", "16234567"],
"extractType": "full",
"format": "json"
}
USPTO Databases Supported
Patent Databases
- PatFT: Full-Text Database (granted patents from 1976)
- AppFT: Published Applications (from 2001)
- Patent Public Search: New unified search system
- PAIR: Patent Application Information Retrieval
Trademark Databases
- TESS: Trademark Electronic Search System
- TSDR: Trademark Status & Document Retrieval
- ID Manual: Acceptable Identification of Goods/Services
Advanced Features
Selenium WebDriver Integration (NEW)
The application now includes experimental Selenium WebDriver support for browser automation:
Features:
- Real Browser Automation: Uses actual Chrome/Firefox browsers to bypass bot detection
- Anti-Detection Measures: Implements various techniques to avoid detection
- Disables automation flags
- Randomized delays to mimic human behavior
- Natural scrolling patterns
- Custom user agent strings
- Fallback System: Automatically falls back to mock data if scraping fails
- Headless Mode: Can run with or without visible browser window
Setup:
# Install ChromeDriver
npm install -g chromedriver
# Enable Selenium in your environment
export USE_SELENIUM=true
export SELENIUM_HEADLESS=false # Set to true for headless mode
# Run the application
npm run dev
Testing Selenium:
# Test Selenium functionality
npx tsx test-selenium.ts
# Test with API
USE_SELENIUM=true curl -X POST http://localhost:3001/api/patents/search \
-H "Content-Type: application/json" \
-d '{"query": "artificial intelligence", "limit": 5}'
Current Limitations:
- USPTO websites have strong anti-bot measures
- Google Patents blocks automated requests after detection
- May require proxy rotation for production use
- Performance is slower than API-based approaches
Smart Crawling Strategies
The crawler automatically adapts to different USPTO page types:
-
Patent Pages:
- Extracts patent number, title, abstract
- Captures inventors, assignees, claims
- Downloads PDFs when available
-
Trademark Pages:
- Extracts mark text, owner information
- Captures goods/services descriptions
- Downloads mark images
-
Search Results:
- Paginates through results automatically
- Maintains session for complex searches
- Handles CAPTCHA detection
Data Extraction Options
- Summary: Basic information only
- Full: Complete document with all sections
- Status: Current status and prosecution history
- Custom: Specify exact fields to extract
Export Formats
- JSON: Structured data with all fields
- CSV: Tabular format for spreadsheets
- Excel: Formatted workbook with multiple sheets
- PDF: Formatted reports (coming soon)
Configuration
Environment Variables
# Server Configuration
PORT=3001
LOG_LEVEL=info
# Selenium Configuration (NEW)
USE_SELENIUM=true # Enable Selenium-based scraping
SELENIUM_HEADLESS=false # Run browser in headless mode
SELENIUM_BROWSER=chrome # Browser to use (chrome/firefox)
# Crawl4AI Configuration
CRAWL4AI_HEADLESS=true
CRAWL4AI_TIMEOUT=30000
CRAWL4AI_USER_AGENT="USPTO-Crawler/0.2"
# AI Configuration (for LLM extraction)
OPENAI_API_KEY=your_api_key # Optional
LLM_PROVIDER=openai
LLM_MODEL=gpt-4
# Rate Limiting
MAX_CONCURRENT_CRAWLS=5
DELAY_BETWEEN_REQUESTS=1000
Custom Extraction Rules
Create custom extraction rules in config/extraction-rules.json:
{
"patent": {
"selectors": {
"title": "h1.patent-title",
"abstract": "div.abstract",
"claims": "div.claims"
}
},
"trademark": {
"selectors": {
"mark": "div.mark-text",
"owner": "span.owner-name",
"status": "div.status-container"
}
}
}
Troubleshooting
Common Issues
Crawl4AI Installation Failed
# Install with specific version
pip install crawl4ai==0.3.0
# Install Playwright browsers
playwright install chromium
USPTO Rate Limiting
- The crawler includes automatic rate limiting
- If blocked, wait 15 minutes before retrying
- Consider using proxy rotation for large-scale extraction
No Results Found
- USPTO search syntax is specific
- Try simpler queries first
- Check date ranges are valid
- Verify classification codes
Debug Mode
Enable debug logging:
LOG_LEVEL=debug npm run dev
View Crawl4AI logs:
tail -f logs/crawl4ai.log
Performance
Optimization Tips
- Batch Processing: Use bulk extraction for multiple items
- Caching: Results are cached for 24 hours
- Parallel Crawling: Up to 5 concurrent crawls by default
- Smart Routing: AI determines optimal extraction strategy
Benchmarks
- Single patent search: ~2-3 seconds
- Bulk extraction (100 items): ~5 minutes
- Full patent download: ~5 seconds
- Status check: ~1 second
Legal Notice
This tool is for educational and research purposes. Please:
- Respect USPTO's terms of service
- Use reasonable rate limiting
- Don't overwhelm USPTO servers
- Cite USPTO as the data source
Contributing
Contributions are welcome! Areas for improvement:
- Additional extraction strategies
- Support for more USPTO databases
- Enhanced AI extraction rules
- Performance optimizations
- UI/UX improvements
License
MIT License - See LICENSE file
Support
- Issues: GitHub Issues
- Documentation: Wiki
- Email: yobie.benjamin@example.com
Acknowledgments
- Crawl4AI for the amazing crawling framework
- USPTO for providing public access to patent and trademark data
- Anthropic for the MCP protocol
- The open-source community
Built with ❤️ by Yobie Benjamin
Making USPTO data accessible to everyone
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.