MCP Servers

Job URL Analyzer MCP Server

A FastAPI-based microservice that analyzes job URLs and extracts detailed company information by crawling job postings and company websites, with data enrichment from external providers.

README

Job URL Analyzer MCP Server

A comprehensive FastAPI-based microservice for analyzing job URLs and extracting detailed company information. Built with modern async Python, this service crawls job postings and company websites to build rich company profiles with data enrichment from external providers.

✨ Features

🕷️ Intelligent Web Crawling: Respectful crawling with robots.txt compliance and rate limiting
🧠 Content Extraction: Advanced HTML parsing using Selectolax for fast, accurate data extraction
🔗 Data Enrichment: Pluggable enrichment providers (Crunchbase, LinkedIn, custom APIs)
📊 Quality Scoring: Completeness and confidence metrics for extracted data
📝 Markdown Reports: Beautiful, comprehensive company analysis reports
🔍 Observability: OpenTelemetry tracing, Prometheus metrics, structured logging
🚀 Production Ready: Docker, Kubernetes, health checks, graceful shutdown
🧪 Well Tested: Comprehensive test suite with 80%+ coverage

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   FastAPI App   │───▶│   Orchestrator  │───▶│   Web Crawler   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │ Content Extract │    │    Database     │
                       └─────────────────┘    │   (SQLAlchemy)  │
                                │             └─────────────────┘
                                ▼                        
                       ┌─────────────────┐    ┌─────────────────┐
                       │   Enrichment    │───▶│    Providers    │
                       │    Manager      │    │ (Crunchbase,etc)│
                       └─────────────────┘    └─────────────────┘
                                │                        
                                ▼                        
                       ┌─────────────────┐              
                       │ Report Generator│              
                       └─────────────────┘

🚀 Quick Start

Prerequisites

Python 3.11+
Poetry (for dependency management)
Docker & Docker Compose (optional)

Local Development

Clone and Setup

git clone https://github.com/subslink326/job-url-analyzer-mcp.git
cd job-url-analyzer-mcp
poetry install

Environment Configuration (Optional)

# The application has sensible defaults and can run without environment configuration
# To customize settings, create a .env file with your configuration
# See src/job_url_analyzer/config.py for available settings

Database Setup
```
poetry run alembic upgrade head
```

Run Development Server

poetry run python -m job_url_analyzer.main
# Server starts at http://localhost:8000

Docker Deployment

Development
```
docker-compose up --build
```

Production

docker-compose -f docker-compose.prod.yml up -d

📡 API Usage

Analyze Job URL

curl -X POST "http://localhost:8000/analyze" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://company.com/jobs/software-engineer",
    "include_enrichment": true,
    "force_refresh": false
  }'

Response Example

{
  "profile_id": "123e4567-e89b-12d3-a456-426614174000",
  "source_url": "https://company.com/jobs/software-engineer",
  "company_profile": {
    "name": "TechCorp",
    "description": "Leading AI company...",
    "industry": "Technology",
    "employee_count": 150,
    "funding_stage": "Series B",
    "total_funding": 25.0,
    "headquarters": "San Francisco, CA",
    "tech_stack": ["Python", "React", "AWS"],
    "benefits": ["Health insurance", "Remote work"]
  },
  "completeness_score": 0.85,
  "confidence_score": 0.90,
  "processing_time_ms": 3450,
  "enrichment_sources": ["crunchbase"],
  "markdown_report": "# TechCorp - Company Analysis Report\n..."
}

⚙️ Configuration

Environment Variables

Variable	Description	Default
`DEBUG`	Enable debug mode	`false`
`HOST`	Server host	`0.0.0.0`
`PORT`	Server port	`8000`
`DATABASE_URL`	Database connection string	`sqlite+aiosqlite:///./data/job_analyzer.db`
`MAX_CONCURRENT_REQUESTS`	Max concurrent HTTP requests	`10`
`REQUEST_TIMEOUT`	HTTP request timeout (seconds)	`30`
`CRAWL_DELAY`	Delay between requests (seconds)	`1.0`
`RESPECT_ROBOTS_TXT`	Respect robots.txt	`true`
`ENABLE_CRUNCHBASE`	Enable Crunchbase enrichment	`false`
`CRUNCHBASE_API_KEY`	Crunchbase API key	`""`
`DATA_RETENTION_DAYS`	Data retention period	`90`

📊 Monitoring

Metrics Endpoints

Health Check: GET /health
Prometheus Metrics: GET /metrics

Key Metrics

job_analyzer_requests_total - Total API requests
job_analyzer_analysis_success_total - Successful analyses
job_analyzer_completeness_score - Data completeness distribution
job_analyzer_crawl_requests_total - Crawl requests by status
job_analyzer_enrichment_success_total - Enrichment success by provider

🧪 Testing

Run Tests

# Unit tests
poetry run pytest

# With coverage
poetry run pytest --cov=job_url_analyzer --cov-report=html

# Integration tests only
poetry run pytest -m integration

# Skip slow tests
poetry run pytest -m "not slow"

🚀 Deployment

Kubernetes

# Apply manifests
kubectl apply -f kubernetes/

# Check deployment
kubectl get pods -l app=job-analyzer
kubectl logs -f deployment/job-analyzer

Production Checklist

[ ] Environment variables configured
[ ] Database migrations applied
[ ] SSL certificates configured
[ ] Monitoring dashboards set up
[ ] Log aggregation configured
[ ] Backup strategy implemented
[ ] Rate limiting configured
[ ] Resource limits set

🔧 Development

Project Structure

job-url-analyzer/
├── src/job_url_analyzer/          # Main application code
│   ├── enricher/                  # Enrichment providers
│   ├── main.py                    # FastAPI application
│   ├── config.py                  # Configuration
│   ├── models.py                  # Pydantic models
│   ├── database.py                # Database models
│   ├── crawler.py                 # Web crawler
│   ├── extractor.py               # Content extraction
│   ├── orchestrator.py            # Main orchestrator
│   └── report_generator.py        # Report generation
├── tests/                         # Test suite
├── alembic/                       # Database migrations
├── kubernetes/                    # K8s manifests
├── monitoring/                    # Monitoring configs
├── docker-compose.yml             # Development setup
├── docker-compose.prod.yml        # Production setup
└── Dockerfile                     # Container definition

Code Quality

The project uses:

Black for code formatting
Ruff for linting
MyPy for type checking
Pre-commit hooks for quality gates

# Setup pre-commit
poetry run pre-commit install

# Run quality checks
poetry run black .
poetry run ruff check .
poetry run mypy src/

📝 Recent Changes

Dependency Updates

Fixed: Replaced non-existent aiohttp-robotparser dependency with robotexclusionrulesparser for robots.txt parsing
Improved: Setup process now works out-of-the-box without requiring .env file configuration

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Ensure all tests pass (poetry run pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Documentation: This README and inline code comments
Issues: GitHub Issues for bug reports and feature requests
Discussions: GitHub Discussions for questions and community

Built with ❤️ using FastAPI, SQLAlchemy, and modern Python tooling.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured