
Global MCP Server
A modular MCP server that extends GitHub Copilot's capabilities through intelligent context compression and dynamic model routing for long-lived coding sessions.
README
Global MCP Server
A modular MCP (Model Context Protocol) server that extends GitHub Copilot's capabilities by providing intelligent context compression and dynamic model routing for long-lived coding sessions.
Overview
During extended development sessions, context windows can become overwhelmed with large amounts of code, documentation, and conversation history. The Global MCP Server addresses this challenge through:
- Context Compression: Intelligently reduces KV cache size while preserving semantic meaning
- Smart Routing: Routes prompts to appropriately-sized models based on complexity analysis
- Tool Chaining: Seamlessly integrates multiple compression and routing techniques
- External Integrations: Connects with Jira, GitHub, and filesystem for comprehensive development workflows
Core Services
🔬 FreqKV Service - Frequency Domain Compression
What it does: Compresses large context windows using Discrete Cosine Transform (DCT) to remove high-frequency "noise" while preserving essential information.
How it works:
- Applies DCT to convert context embeddings from time domain to frequency domain
- Removes high-frequency components that contribute less to semantic meaning
- Preserves "sink tokens" (first N tokens) that are critical for context understanding
- Reconstructs compressed representation using inverse DCT
Benefits:
- Reduces context size by 30-70% while maintaining semantic fidelity
- Particularly effective for removing redundant or repetitive information
- Fast processing using optimized NumPy/SciPy operations
Example: A 1000-token context becomes 300 tokens with 70% of semantic information preserved.
🔗 LoCoCo Service - Convolution-based Context Fusion
What it does: Further compresses context by fusing multiple tokens into representative "super-tokens" using 1D convolution.
How it works:
- Applies sliding window convolution across the token sequence
- Uses learnable kernels to combine adjacent tokens into fused representations
- Maintains fixed output size regardless of input length
- Preserves local relationships between tokens through overlapping windows
Benefits:
- Consistent output size for predictable memory usage
- Maintains local context relationships
- Configurable compression ratios and kernel sizes
- Works synergistically with FreqKV for multi-stage compression
Example: After FreqKV reduces 1000→300 tokens, LoCoCo further compresses to 128 fixed-size tokens.
🧠 Routing Service - Intelligent Model Selection
What it does: Analyzes prompt complexity and routes requests to the most appropriate local LLM to optimize response time and resource usage.
Orchestration Method: Uses direct API calls with fallback mechanisms - no external orchestration platform required.
How it works:
- Pattern Matching: Uses regex patterns to identify complexity indicators
- Heuristic Analysis: Considers prompt length, technical keywords, and code complexity
- Classification Scoring: Combines multiple signals to classify as "simple", "moderate", or "complex"
- Model Selection: Routes to appropriate model tier (Phi-3 → Mistral → Llama-3)
- Direct API Communication: Makes HTTP calls directly to model endpoints (Ollama, custom APIs)
- Graceful Fallbacks: Automatically switches to mock responses if models are unavailable
Complexity Classifications:
-
Simple (
phi-3
): Basic formatting, renaming, simple fixes- Examples: "Fix indentation", "Add import statement", "Rename variable"
-
Moderate (
mistral
): Code implementation, refactoring, debugging- Examples: "Implement function", "Refactor class", "Debug error"
-
Complex (
llama-3
): Architecture, integration, performance optimization- Examples: "Design microservices", "Optimize database queries", "Build CI/CD pipeline"
Benefits:
- Faster responses for simple tasks (3B vs 70B parameter models)
- Better resource utilization
- Scalable to team usage patterns
- Fallback mechanisms for model unavailability
📊 Model Registry - Endpoint Management
What it does: Provides a pluggable system for managing multiple LLM endpoints and their routing configurations.
How it works:
- Model Registration: Maps model names to endpoints (Ollama, HTTP APIs, etc.)
- Complexity Mapping: Associates complexity levels with specific models
- Configuration Persistence: Stores settings in JSON for easy modification
- Runtime Updates: Allows dynamic model registration and routing changes
Supported Endpoints:
- Ollama:
ollama://model-name
for local models - HTTP APIs: Direct HTTP endpoints for custom model servers
- Mock Endpoints: For testing and development
Tool Chain Pipeline
The services work together in a coordinated pipeline:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Input │───▶│ FreqKV │───▶│ LoCoCo │───▶│ Routing │
│ Context │ │ Compression │ │ Fusion │ │ & Response │
│ │ │ │ │ │ │ │
│ 1000 tokens │ │ 300 tokens │ │ 128 tokens │ │ Optimized │
│ │ │ (DCT-based) │ │ (Conv-based)│ │ Model Route │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
- Context Ingestion: Large context (code files, conversation history)
- Frequency Compression: FreqKV removes semantic redundancy
- Spatial Compression: LoCoCo fuses tokens into fixed-size representation
- Complexity Analysis: Routing service analyzes prompt characteristics
- Model Selection: Route to appropriate model based on complexity
- Response Generation: Generate response using compressed context
Installation
pip install -r requirements.txt
Usage
python -m mcp.server
Configuration
The server uses .vscode/mcp.json
for MCP tool configurations including Jira, GitHub, and filesystem integrations.
MCP Tool Integration
The Global MCP Server provides several tools that integrate seamlessly with GitHub Copilot:
Available Tools
-
compress_kv_cache
: Compresses large context windows- Input: KV cache array, compression settings
- Output: Compressed cache with statistics
- Use case: Reduce memory usage for long conversations
-
route_prompt
: Intelligently routes prompts to appropriate models- Input: Prompt text, optional context
- Output: Model response with routing decision explanation
- Use case: Optimize response time and resource usage
-
process_full_pipeline
: Runs complete compression + routing pipeline- Input: Prompt + optional KV cache
- Output: Compressed context + routed response
- Use case: End-to-end optimization for complex development tasks
MCP Integration Benefits
- Transparent Compression: Context compression happens automatically
- Intelligent Scaling: Automatically adapts to prompt complexity
- Resource Optimization: Uses appropriate model size for each task
- Seamless Fallbacks: Graceful degradation when services are unavailable
External Service Integrations
The server coordinates with multiple external MCP services:
🎫 Jira Integration
- Purpose: Access project tickets, create issues, update status
- Tools: Query tickets, create tasks, update assignees
- Configuration: Requires Jira URL, username, and API token
🐙 GitHub Integration
- Purpose: Repository operations, PR management, issue tracking
- Tools: Read files, create branches, manage pull requests
- Configuration: Requires GitHub personal access token
📁 Filesystem Integration
- Purpose: Secure file operations within allowed directories
- Tools: Read/write files, directory operations, search
- Configuration: Whitelist of allowed paths and permissions
Performance Characteristics
Compression Metrics
- FreqKV Compression: 30-70% size reduction with minimal quality loss
- LoCoCo Fusion: Fixed output size regardless of input length
- Combined Pipeline: Up to 90% size reduction while preserving semantic meaning
Routing Performance
- Classification Speed: <50ms for prompt analysis
- Model Selection: Instant lookup from registry
- Response Time Improvement:
- Simple tasks: 3-5x faster (using Phi-3 vs Llama-3)
- Complex tasks: Maintains quality with appropriate model selection
Resource Usage
- Memory: Compressed contexts use 10-50% of original memory
- CPU: Compression adds 100-300ms overhead
- GPU: Model routing optimizes GPU utilization across different model sizes
Installation & Setup
Prerequisites
- Python 3.10 or higher
- Optional: Ollama for local LLM support
- Optional: Redis for caching (future enhancement)
Quick Start
# Clone repository
git clone https://github.com/yourusername/globalmcp.git
cd globalmcp
# Set up development environment
./setup_dev.sh
# Install dependencies
pip install -r requirements.txt
# Run demo to verify installation
python demo.py
# Start the MCP server
python -m mcp.server
Environment Variables
Configure the following environment variables for external service integration:
# Jira Integration
export JIRA_URL="https://yourcompany.atlassian.net"
export JIRA_USERNAME="your-email@company.com"
export JIRA_API_TOKEN="your-jira-token"
# GitHub Integration
export GITHUB_PERSONAL_ACCESS_TOKEN="ghp_your-token-here"
export GITHUB_OWNER="your-github-username"
export GITHUB_REPO="your-default-repo"
# Server Configuration
export MCP_SERVER_HOST="localhost"
export MCP_SERVER_PORT="8000"
Advanced Configuration
VS Code MCP Configuration
The .vscode/mcp.json
file configures all MCP integrations:
{
"mcpServers": {
"globalmcp": {
"command": "python",
"args": ["-m", "mcp.server"],
"env": {
"MCP_SERVER_HOST": "localhost",
"MCP_SERVER_PORT": "8000"
}
},
"jira": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-jira"],
"env": {
"JIRA_URL": "${JIRA_URL}",
"JIRA_USERNAME": "${JIRA_USERNAME}",
"JIRA_API_TOKEN": "${JIRA_API_TOKEN}"
}
}
}
}
Service-Specific Configuration
Each service has its own configuration file in the config/
directory:
model_registry.json
: Model endpoints and complexity mappingsjira_config.json
: Jira connection and project settingsgithub_config.json
: GitHub API and repository settingsfilesystem_config.json
: Allowed paths and security settings
Model Configuration
Customize model routing in config/model_registry.json
:
{
"models": {
"phi3": "ollama://phi3",
"mistral": "ollama://mistral",
"llama3": "ollama://llama3"
},
"complexity_mapping": {
"simple": "ollama://phi3",
"moderate": "ollama://mistral",
"complex": "ollama://llama3"
}
}
Usage Examples
Basic Context Compression
# Compress a large KV cache
response = await mcp_client.call_tool("compress_kv_cache", {
"kv_cache": large_context_array,
"sink_tokens": 10,
"compression_ratio": 0.6
})
print(f"Compressed from {response['original_size']} to {response['compressed_size']} tokens")
Smart Prompt Routing
# Route prompt to appropriate model
response = await mcp_client.call_tool("route_prompt", {
"prompt": "Implement a Redis caching layer for this API",
"context": "Working on a Node.js microservice"
})
print(f"Routed to {response['model_used']} based on {response['complexity']} complexity")
Full Pipeline Processing
# Process through complete pipeline
response = await mcp_client.call_tool("process_full_pipeline", {
"prompt": "Optimize this database query for better performance",
"kv_cache": conversation_context,
"context": "PostgreSQL database with 1M+ records"
})
# Get both compression and routing results
compression_stats = response['compression']
routing_decision = response['routing']
Development & Testing
Running Tests
# Install test dependencies
pip install -r requirements-dev.txt
# Run all tests
pytest
# Run specific service tests
pytest mcp/tests/test_freqkv.py -v
pytest mcp/tests/test_lococo.py -v
Demo Script
The included demo script shows all features:
python demo.py
This demonstrates:
- KV cache compression pipeline
- Prompt complexity classification
- Model routing decisions
- End-to-end processing
Development Mode
Start the server in development mode with auto-reload:
uvicorn mcp.server:app --reload --host 0.0.0.0 --port 8000
Architecture Decisions
Why Frequency Domain Compression?
- Semantic Preservation: DCT naturally separates important low-frequency information from noise
- Computational Efficiency: Fast FFT algorithms make compression lightweight
- Tunable Quality: Compression ratio directly controls quality vs size tradeoffs
Why Convolution for Token Fusion?
- Local Context Preservation: Sliding windows maintain relationships between adjacent tokens
- Fixed Output Size: Predictable memory usage regardless of input size
- Hardware Optimized: Convolution operations are highly optimized on modern hardware
Why Pattern-Based Routing?
- Fast Classification: Regex patterns provide instant complexity assessment
- Interpretable Decisions: Clear reasoning for routing choices
- Easy Customization: Patterns can be updated without retraining models
- Fallback Ready: Works even when classification models are unavailable
Troubleshooting
Common Issues
- Import Errors: Ensure all dependencies are installed with
pip install -r requirements.txt
- Ollama Connection: Verify Ollama is running on localhost:11434
- Configuration: Check that
.vscode/mcp.json
has correct paths and environment variables - Permissions: Ensure filesystem paths in config are accessible
Debug Mode
Enable detailed logging:
python -m mcp.server --log-level DEBUG
Health Checks
Verify server status:
curl http://localhost:8000/health
Contributing
See CONTRIBUTING.md for development guidelines and coding standards.
License
This project follows standard open source licensing practices.
Orchestration Architecture
The Global MCP Server uses a lightweight, direct-communication orchestration model rather than complex service mesh or message queue systems:
Orchestration Components
- FastAPI Application Server: Central coordination point for all MCP requests
- Direct API Calls: Services communicate via HTTP/HTTPS without intermediary layers
- Built-in Service Discovery: Model registry provides endpoint lookup without external service discovery
- Async/Await Concurrency: Python asyncio handles concurrent requests efficiently
Model Orchestration Methods
Ollama Integration
# Direct HTTP API calls to Ollama server
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/generate",
json={
"model": "phi3",
"prompt": prompt,
"stream": False
}
)
Custom HTTP Endpoints
# Generic HTTP API support for any model server
response = await client.post(
model_endpoint,
json={
"prompt": prompt,
"max_tokens": 512
}
)
Fallback Mechanisms
- Connection Failures: Automatic fallback to mock responses
- Model Unavailable: Route to alternative model in same complexity tier
- Timeout Handling: 30-second timeouts with graceful degradation
Why This Orchestration Approach?
- Simplicity: No external dependencies like Kubernetes, Docker Swarm, or service meshes
- Performance: Direct API calls minimize latency vs message queues
- Reliability: Fewer moving parts means fewer failure points
- Development Speed: Easy to debug and extend without orchestration complexity
- Resource Efficiency: Minimal overhead compared to heavy orchestration platforms
Comparison with Alternative Orchestration
Method | Complexity | Latency | Dependencies | Use Case |
---|---|---|---|---|
Direct API (Current) | Low | <100ms | None | Development tools, local deployment |
Kubernetes | High | 200-500ms | K8s cluster | Production at scale |
Docker Swarm | Medium | 150-300ms | Docker | Medium-scale deployment |
Message Queues | Medium | 100-200ms | Redis/RabbitMQ | Asynchronous processing |
Future Orchestration Enhancements
For production scaling, the architecture supports easy migration to:
- Load Balancers: HAProxy or Nginx for model endpoint distribution
- Container Orchestration: Docker Compose or Kubernetes manifests
- Service Mesh: Istio or Linkerd for advanced traffic management
- Message Queues: Redis or RabbitMQ for asynchronous request processing
Routing Strategy Analysis
Current Implementation: Regex Pattern Matching
The current router uses regex pattern matching combined with heuristic analysis for prompt classification. Here's a detailed comparison of approaches:
Regex Pattern Matching (Current)
Advantages:
- Ultra-low latency: <1ms classification time
- Zero dependencies: No additional model loading or GPU memory
- Deterministic: Same input always produces same output
- Interpretable: Clear reasoning for routing decisions
- No network calls: Entirely local computation
- Easy to debug: Pattern matches are visible and traceable
- Customizable: Patterns can be updated instantly without retraining
Disadvantages:
- Limited context understanding: Cannot understand semantic nuance
- Brittle to variations: "implement function" vs "build a function" might route differently
- Manual maintenance: Patterns need manual updates for new use cases
- False positives: May misclassify edge cases
Current Implementation Performance:
# Classification time: <1ms
complexity_scores = {
"simple": 2, # Matched "fix" and "format"
"moderate": 0, # No matches
"complex": 0 # No matches
}
# Result: "simple" complexity → routes to Phi-3
Super Lightweight LLM Approach
Advantages:
- Semantic understanding: Can understand intent beyond keywords
- Context awareness: Considers full prompt context and nuance
- Adaptive: Improves with better training data
- Robust to variations: Handles paraphrasing and edge cases better
- Future-proof: Can evolve with new prompt patterns
Disadvantages:
- Higher latency: 50-200ms for small models like TinyLlama/Phi-3-mini
- Resource overhead: Requires GPU/CPU for inference
- Model dependency: Need to load and maintain classification model
- Less predictable: Same input might vary slightly in output
- Complex debugging: Black box decision making
- Cold start penalty: Initial model loading time
Hybrid Approach Recommendation
Best of both worlds - Use regex as primary with LLM fallback:
async def classify_complexity_hybrid(self, prompt: str, context: str = "") -> str:
# Fast regex classification first
regex_result = await self.classify_with_patterns(prompt, context)
confidence = self.calculate_pattern_confidence(prompt, context)
# If confidence is high, use regex result
if confidence > 0.8:
return regex_result
# For ambiguous cases, use lightweight LLM
return await self.classify_with_llm(prompt, context)
Performance Comparison
Method | Latency | Memory | Accuracy | Maintenance |
---|---|---|---|---|
Regex Only | <1ms | 0MB | 85-90% | Manual patterns |
LLM Only | 50-200ms | 100-500MB | 92-95% | Training data |
Hybrid | 1-200ms | 100-500MB | 90-95% | Best balance |
Recommendation: Stick with Regex (Current)
For this use case, regex pattern matching is the better choice because:
- Speed is Critical: Router decisions happen frequently and need to be fast
- Resource Efficiency: No additional GPU memory or model loading
- Reliability: Deterministic behavior is important for development tools
- Sufficient Accuracy: 85-90% accuracy is acceptable for development task routing
- Easy Maintenance: Patterns can be updated based on usage analytics
Future Enhancement Strategy
Phase 1 (Current): Regex + heuristics ✅ Phase 2: Add confidence scoring and analytics Phase 3: Hybrid approach for ambiguous cases Phase 4: Full LLM classification for production at scale
Pattern Optimization Recommendations
To improve the current regex approach:
# Enhanced patterns with better coverage
self.complexity_patterns = {
"simple": [
r"\b(fix|format|indent|rename|import|add|remove|delete)\b",
r"\b(typo|syntax|missing|extra)\s+(error|semicolon|bracket|quote)\b",
r"\bgenerate\s+(getter|setter|constructor|comment)\b",
r"\b(what|where|when|how|why)\s+(is|does|should)\b"
],
"moderate": [
r"\b(refactor|optimize|implement|create|build|write)\b",
r"\b(function|method|class|component|module)\b",
r"\b(test|debug|fix)\s+(bug|issue|error|problem)\b",
r"\b(explain|describe|analyze|review)\s+.*(code|logic|algorithm)\b"
],
"complex": [
r"\b(architect|design|migrate|transform|scale)\b",
r"\b(integrate|connect|sync)\s+.*(api|database|service|system)\b",
r"\b(performance|security|scalability)\s+(optimization|concern|issue)\b",
r"\b(microservice|distributed|architecture|infrastructure)\b"
]
}
Analytics-Driven Improvement
Add classification analytics to improve patterns over time:
# Track classification accuracy
classification_metrics = {
"total_classifications": 1250,
"user_corrections": 127, # When users manually override
"accuracy": 89.8, # Calculated accuracy
"pattern_hits": {
"simple": {"fix": 45, "format": 23, "rename": 18},
"moderate": {"implement": 67, "refactor": 34, "debug": 28},
"complex": {"architect": 12, "integrate": 19, "performance": 15}
}
}
Deployment Strategy Analysis
Docker Containerization vs Local Deployment
The Global MCP Server can be deployed either locally or in Docker containers. Here's a detailed analysis of both approaches:
Local Deployment (Current)
Advantages:
- Fastest Development: Direct Python execution with instant reloads
- Easy Debugging: Full access to debugger, logs, and development tools
- No Container Overhead: Direct access to host resources
- Simple Setup: Just
pip install
and run - VS Code Integration: Seamless integration with VS Code MCP configuration
- File System Access: Direct access to project files without volume mounts
Disadvantages:
- Environment Conflicts: Python version and dependency conflicts
- Manual Dependency Management: Need to manage Python, Ollama, etc. separately
- OS-Specific Issues: Different behavior across Windows/Mac/Linux
- No Isolation: Potential conflicts with other Python projects
Docker Container Deployment
Advantages:
- Environment Isolation: Consistent runtime across all platforms
- Dependency Management: All dependencies packaged together
- Easy Distribution: Single container image works everywhere
- Scalability: Easy to scale multiple instances
- Production Ready: Better for production deployments
- Version Control: Tagged container images for releases
- Security: Process isolation and sandboxing
Disadvantages:
- Development Overhead: Build times and container complexity
- Resource Usage: Additional memory and CPU overhead
- Network Complexity: Need to expose ports and handle networking
- Volume Management: File access requires volume mounts
- Debugging Complexity: More complex to debug containerized apps
Hybrid Recommendation: Both Approaches
For Development: Keep local deployment as primary For Production/Distribution: Docker support ✅ IMPLEMENTED
Docker Implementation Strategy ✅
The project now includes full Docker containerization with the following files:
Dockerfile
: Multi-stage build for development and productiondocker-compose.yml
: Development environment with hot reloaddocker-compose.prod.yml
: Production environment with security hardeningdocker.sh
: Helper script for common Docker operationsDOCKER.md
: Comprehensive Docker setup and usage guide
Docker Quick Start
# Development
./docker.sh dev
# Production
./docker.sh prod
# View all commands
./docker.sh help
See DOCKER.md for complete setup instructions, troubleshooting, and best practices.
Container Performance Comparison
Deployment | Startup Time | Memory Usage | Development Speed | Production Ready |
---|---|---|---|---|
Local | <1s | 50-100MB | ⭐⭐⭐⭐⭐ | ⭐⭐ |
Docker | 2-5s | 100-200MB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
VS Code MCP Integration with Docker
Update .vscode/mcp.json
to support both local and containerized deployment:
{
"mcpServers": {
"globalmcp-local": {
"command": "python",
"args": ["-m", "mcp.server"],
"env": {
"MCP_SERVER_HOST": "localhost",
"MCP_SERVER_PORT": "8000"
}
},
"globalmcp-docker": {
"command": "docker",
"args": ["run", "--rm", "-p", "8000:8000", "globalmcp:latest"],
"env": {
"MCP_SERVER_HOST": "localhost",
"MCP_SERVER_PORT": "8000"
}
}
}
}
Recommendation: Hybrid Approach
For this project, I recommend keeping local deployment as primary with Docker as an option:
- Development Phase: Use local deployment for faster iteration
- Testing Phase: Use Docker to test deployment and distribution
- Production Phase: Use Docker for consistent deployments
- Distribution Phase: Provide Docker images for easy setup
When to Choose Each Approach
Choose Local Deployment When:
- Developing and debugging the MCP server
- Working with VS Code Copilot integration
- Need fastest possible startup and reload times
- Working on a single developer machine
Choose Docker Deployment When:
- Deploying to production or staging environments
- Distributing to other developers or users
- Need consistent environment across platforms
- Running on servers or cloud platforms
- Want process isolation and security
Implementation Priority
Phase 1 (Current): Local deployment ✅
Phase 2: Add Docker support for production deployment
Phase 3: Add Docker Compose for full development stack
Phase 4: Add Kubernetes manifests for enterprise deployment
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.