vLLM Semantic Router

An Mixture-of-Models (MoM) router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on Semantic Understanding of the request's intent.
This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives within a model, this system selects the best entire model for the nature of the task.

🚀 Key Features
🎯 Auto-selection of Models
Intelligently routes requests to specialized models based on semantic understanding:
- Math queries → Math-specialized models
- Creative writing → Creative-specialized models
- Code generation → Code-specialized models
- General queries → Balanced general-purpose models
🛡️ Security & Privacy
- PII Detection: Automatically detects and handles personally identifiable information
- Prompt Guard: Identifies and blocks jailbreak attempts
- Safe Routing: Ensures sensitive prompts are handled appropriately
⚡ Performance Optimization
- Semantic Cache: Caches semantic representations to reduce latency
- Tool Selection: Auto-selects relevant tools to reduce token usage and improve tool selection accuracy
🏗️ Architecture
- Envoy ExtProc Integration: Seamlessly integrates with Envoy proxy
- Dual Implementation: Available in both Go (with Rust FFI) and Python
- Scalable Design: Production-ready with comprehensive monitoring
📊 Performance Benefits
Our testing shows significant improvements in model accuracy through specialized routing.

🛠️ Architecture Overview

🎯 Use Cases
- Enterprise API Gateways: Route different types of queries to cost-optimized models
- Multi-tenant Platforms: Provide specialized routing for different customer needs
- Development Environments: Balance cost and performance for different workloads
- Production Services: Ensure optimal model selection with built-in safety measures
📈 Monitoring & Observability
The router provides comprehensive monitoring through:
- Grafana Dashboard: Real-time metrics and performance tracking
- Prometheus Metrics: Detailed routing statistics and performance data
- Request Tracing: Full visibility into routing decisions and performance

📖 Documentation
For comprehensive documentation including detailed setup instructions, architecture guides, and API references, visit:
👉 Complete Documentation at Read the Docs
The documentation includes: