
Job URL Analyzer MCP Server
A FastAPI-based microservice that analyzes job URLs and extracts detailed company information by crawling job postings and company websites, with data enrichment from external providers.
README
Job URL Analyzer MCP Server
A comprehensive FastAPI-based microservice for analyzing job URLs and extracting detailed company information. Built with modern async Python, this service crawls job postings and company websites to build rich company profiles with data enrichment from external providers.
✨ Features
- 🕷️ Intelligent Web Crawling: Respectful crawling with robots.txt compliance and rate limiting
- 🧠 Content Extraction: Advanced HTML parsing using Selectolax for fast, accurate data extraction
- 🔗 Data Enrichment: Pluggable enrichment providers (Crunchbase, LinkedIn, custom APIs)
- 📊 Quality Scoring: Completeness and confidence metrics for extracted data
- 📝 Markdown Reports: Beautiful, comprehensive company analysis reports
- 🔍 Observability: OpenTelemetry tracing, Prometheus metrics, structured logging
- 🚀 Production Ready: Docker, Kubernetes, health checks, graceful shutdown
- 🧪 Well Tested: Comprehensive test suite with 80%+ coverage
🏗️ Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ FastAPI App │───▶│ Orchestrator │───▶│ Web Crawler │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Content Extract │ │ Database │
└─────────────────┘ │ (SQLAlchemy) │
│ └─────────────────┘
▼
┌─────────────────┐ ┌─────────────────┐
│ Enrichment │───▶│ Providers │
│ Manager │ │ (Crunchbase,etc)│
└─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Report Generator│
└─────────────────┘
🚀 Quick Start
Prerequisites
- Python 3.11+
- Poetry (for dependency management)
- Docker & Docker Compose (optional)
Local Development
-
Clone and Setup
git clone https://github.com/subslink326/job-url-analyzer-mcp.git cd job-url-analyzer-mcp poetry install
-
Environment Configuration (Optional)
# The application has sensible defaults and can run without environment configuration # To customize settings, create a .env file with your configuration # See src/job_url_analyzer/config.py for available settings
-
Database Setup
poetry run alembic upgrade head
-
Run Development Server
poetry run python -m job_url_analyzer.main # Server starts at http://localhost:8000
Docker Deployment
-
Development
docker-compose up --build
-
Production
docker-compose -f docker-compose.prod.yml up -d
📡 API Usage
Analyze Job URL
curl -X POST "http://localhost:8000/analyze" \
-H "Content-Type: application/json" \
-d '{
"url": "https://company.com/jobs/software-engineer",
"include_enrichment": true,
"force_refresh": false
}'
Response Example
{
"profile_id": "123e4567-e89b-12d3-a456-426614174000",
"source_url": "https://company.com/jobs/software-engineer",
"company_profile": {
"name": "TechCorp",
"description": "Leading AI company...",
"industry": "Technology",
"employee_count": 150,
"funding_stage": "Series B",
"total_funding": 25.0,
"headquarters": "San Francisco, CA",
"tech_stack": ["Python", "React", "AWS"],
"benefits": ["Health insurance", "Remote work"]
},
"completeness_score": 0.85,
"confidence_score": 0.90,
"processing_time_ms": 3450,
"enrichment_sources": ["crunchbase"],
"markdown_report": "# TechCorp - Company Analysis Report\n..."
}
⚙️ Configuration
Environment Variables
Variable | Description | Default |
---|---|---|
DEBUG |
Enable debug mode | false |
HOST |
Server host | 0.0.0.0 |
PORT |
Server port | 8000 |
DATABASE_URL |
Database connection string | sqlite+aiosqlite:///./data/job_analyzer.db |
MAX_CONCURRENT_REQUESTS |
Max concurrent HTTP requests | 10 |
REQUEST_TIMEOUT |
HTTP request timeout (seconds) | 30 |
CRAWL_DELAY |
Delay between requests (seconds) | 1.0 |
RESPECT_ROBOTS_TXT |
Respect robots.txt | true |
ENABLE_CRUNCHBASE |
Enable Crunchbase enrichment | false |
CRUNCHBASE_API_KEY |
Crunchbase API key | "" |
DATA_RETENTION_DAYS |
Data retention period | 90 |
📊 Monitoring
Metrics Endpoints
- Health Check:
GET /health
- Prometheus Metrics:
GET /metrics
Key Metrics
job_analyzer_requests_total
- Total API requestsjob_analyzer_analysis_success_total
- Successful analysesjob_analyzer_completeness_score
- Data completeness distributionjob_analyzer_crawl_requests_total
- Crawl requests by statusjob_analyzer_enrichment_success_total
- Enrichment success by provider
🧪 Testing
Run Tests
# Unit tests
poetry run pytest
# With coverage
poetry run pytest --cov=job_url_analyzer --cov-report=html
# Integration tests only
poetry run pytest -m integration
# Skip slow tests
poetry run pytest -m "not slow"
🚀 Deployment
Kubernetes
# Apply manifests
kubectl apply -f kubernetes/
# Check deployment
kubectl get pods -l app=job-analyzer
kubectl logs -f deployment/job-analyzer
Production Checklist
- [ ] Environment variables configured
- [ ] Database migrations applied
- [ ] SSL certificates configured
- [ ] Monitoring dashboards set up
- [ ] Log aggregation configured
- [ ] Backup strategy implemented
- [ ] Rate limiting configured
- [ ] Resource limits set
🔧 Development
Project Structure
job-url-analyzer/
├── src/job_url_analyzer/ # Main application code
│ ├── enricher/ # Enrichment providers
│ ├── main.py # FastAPI application
│ ├── config.py # Configuration
│ ├── models.py # Pydantic models
│ ├── database.py # Database models
│ ├── crawler.py # Web crawler
│ ├── extractor.py # Content extraction
│ ├── orchestrator.py # Main orchestrator
│ └── report_generator.py # Report generation
├── tests/ # Test suite
├── alembic/ # Database migrations
├── kubernetes/ # K8s manifests
├── monitoring/ # Monitoring configs
├── docker-compose.yml # Development setup
├── docker-compose.prod.yml # Production setup
└── Dockerfile # Container definition
Code Quality
The project uses:
- Black for code formatting
- Ruff for linting
- MyPy for type checking
- Pre-commit hooks for quality gates
# Setup pre-commit
poetry run pre-commit install
# Run quality checks
poetry run black .
poetry run ruff check .
poetry run mypy src/
📝 Recent Changes
Dependency Updates
- Fixed: Replaced non-existent
aiohttp-robotparser
dependency withrobotexclusionrulesparser
for robots.txt parsing - Improved: Setup process now works out-of-the-box without requiring
.env
file configuration
🤝 Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
poetry run pytest
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
- Documentation: This README and inline code comments
- Issues: GitHub Issues for bug reports and feature requests
- Discussions: GitHub Discussions for questions and community
Built with ❤️ using FastAPI, SQLAlchemy, and modern Python tooling.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.