vLLM Semantic Router

An Mixture-of-Models (MoM) router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on Semantic Understanding of the request's intent.

This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives within a model, this system selects the best entire model for the nature of the task.

🚀 Key Features

🎯 Auto-selection of Models

Intelligently routes requests to specialized models based on semantic understanding:

Math queries → Math-specialized models
Creative writing → Creative-specialized models
Code generation → Code-specialized models
General queries → Balanced general-purpose models

🛡️ Security & Privacy

PII Detection: Automatically detects and handles personally identifiable information
Prompt Guard: Identifies and blocks jailbreak attempts
Safe Routing: Ensures sensitive prompts are handled appropriately

⚡ Performance Optimization

Semantic Cache: Caches semantic representations to reduce latency
Tool Selection: Auto-selects relevant tools to reduce token usage and improve tool selection accuracy

🏗️ Architecture

Envoy ExtProc Integration: Seamlessly integrates with Envoy proxy
Dual Implementation: Available in both Go (with Rust FFI) and Python
Scalable Design: Production-ready with comprehensive monitoring

📊 Performance Benefits

Our testing shows significant improvements in model accuracy through specialized routing.

🛠️ Architecture Overview

🎯 Use Cases

Enterprise API Gateways: Route different types of queries to cost-optimized models
Multi-tenant Platforms: Provide specialized routing for different customer needs
Development Environments: Balance cost and performance for different workloads
Production Services: Ensure optimal model selection with built-in safety measures

📈 Monitoring & Observability

The router provides comprehensive monitoring through:

Grafana Dashboard: Real-time metrics and performance tracking
Prometheus Metrics: Detailed routing statistics and performance data
Request Tracing: Full visibility into routing decisions and performance

📖 Documentation

For comprehensive documentation including detailed setup instructions, architecture guides, and API references, visit:

👉 Complete Documentation at Read the Docs

The documentation includes:

Installation Guide - Complete setup instructions
Quick Start - Get running in 5 minutes
System Architecture - Technical deep dive
Model Training - How classification models work
API Reference - Complete API documentation