Technology Deep Dive — Section 5

Intelligent LLM Routing

The best model for each task, automatically. Four routing strategies. Real-time health monitoring. Automatic failover in under 100ms.

4 routing strategies 5 providers monitored <100ms failover

The Problem With Hardcoded Models

Outages bring you down

If your single provider has an outage, your whole product stops. No fallback, no warning, no recovery. Just downtime.

Price hikes are mandatory

Provider raises prices? You pay it. No alternatives, no negotiation. Your cost structure is outside your control.

Better models require rewrites

A better model launches from a competitor. Your hardcoded provider means months of migration work to adopt it.

EnGenAI abstracts all providers behind a single routing layer. Swap, fallback, and route — without changing a single line of your agent code.

Route a Request

Waiting for request...

Health Check

Rate Limit

Strategy (Priority)

Access Gate

Simulate failure

Four Routing Strategies

Choose how traffic is distributed across your providers. Change strategy per agent, per task type, or globally.

Priority

Routes to the highest-priority provider in your list. Predictable and consistent — always your first choice unless it fails.

Best for

Regulated environments, provider lock-in preferences, audit trails requiring consistent model use.

Price

Routes to the cheapest provider that meets the task's quality requirements. Cost savings without sacrificing output quality.

Best for

High-volume workloads, documentation generation, bulk processing, cost-sensitive environments.

Quality

Routes to the provider with the highest quality score for this specific task type. Quality scores updated from real usage data.

Best for

Architecture decisions, complex reasoning, code review, security analysis, critical business logic.

Latency

Routes to the provider with the lowest current response time. Continuously tracked. Adapts as provider performance changes.

Best for

Real-time interactions, streaming responses, time-sensitive tasks, user-facing completions.

Custom

Define your own routing logic with weighted combinations. Example: 60% priority, 30% price, 10% latency. Route specific task types to specific providers. Override routing for individual agents.

priority_weight: 0.6 price_weight: 0.3 latency_weight: 0.1

Provider Health Monitoring

Watch it live. Simulate a provider failure and see traffic shift automatically. Real-time latency, uptime, and circuit breaker state for every connected provider.

Anthropic

Claude Opus 4.6

ACTIVE ROUTE

healthy

Latency: 3.2s

Uptime: 99.8%

OpenAI

GPT-4o

healthy

Latency: 2.4s

Uptime: 99.6%

Google

Gemini 2.5 Pro

degraded

Latency: 6.8s

Uptime: 97.2%

Cohere

Command R+

healthy

Latency: 1.8s

Uptime: 99.1%

Mistral

Mistral Large

down

Latency: N/A

Last seen: 4m ago

Active Route:Anthropic — Claude Opus 4.6

Fallback Chains

Every request has a fallback chain. Primary → Secondary → Tertiary. If all fail, the circuit breaker opens and the user gets a clean error — not a hang.

INCOMING REQUEST

Agent task requires LLM response

PRIMARY Anthropic Claude Opus 4.6

healthy

p50: 3.5s uptime: 99.8% cb: CLOSED

health check

if unhealthy / rate-limited

SECONDARY Anthropic Claude Sonnet 4.6

healthy

p50: 2.1s uptime: 99.7% cb: CLOSED

if unhealthy / rate-limited

TERTIARY OpenAI GPT-4o

healthy

p50: 2.8s uptime: 99.6% cb: CLOSED

if all providers fail

ERROR: No providers available

Clean error returned to user. No silent hang. Circuit breakers remain OPEN until recovery.

CLOSED

Normal operation. Requests flow through. Failures counted but not yet triggering.

OPEN

Provider taken offline after N consecutive failures. Cooldown period begins (60s default).

HALF-OPEN

Single test request allowed. Success = CLOSED again. Failure = OPEN again.

Performance Comparison

Not all models are equal on latency. With latency-based routing, EnGenAI automatically selects the fastest healthy provider for each request.

p50 latency in seconds — illustrative benchmark

Claude Haiku 4.5

0.8s

GPT-4o mini

1.2s

Claude Sonnet 4.6

2.1s

GPT-4o

2.8s

Claude Opus 4.6

3.5s

Gemini 2.5 Pro

4.1s

EnGenAI routes to the fastest healthy provider automatically. When Claude Opus is rate-limited, traffic shifts to Sonnet without interruption. Latency routing monitors real-time p50 values — not static estimates.

Circuit Breaker Pattern

Prevents cascade failures. When a provider fails repeatedly, the circuit opens automatically. No thundering herd. No resource exhaustion. Clean recovery.

CLOSED

Normal operation

Requests flow through normally. Failures are counted but don't yet block the provider. Default state after recovery.

OPEN

Provider halted

After N consecutive failures, the circuit opens. All requests skip this provider immediately. Cooldown timer starts (default: 60s).

H-O

HALF-OPEN

Testing recovery

After cooldown, a single test request is allowed through. Success: CLOSED. Failure: OPEN with extended cooldown.

Circuit breakers operate per-provider, per-tenant. One customer's provider issues do not affect other tenants. State transitions are logged and visible in the observability dashboard.

Rate Limiting

Every provider enforces rate limits. EnGenAI tracks them proactively with sliding-window counters — requests per minute (RPM) and tokens per minute (TPM) — so your agents never hit a 429 error.

RPM Window

60s

Sliding window tracks requests per minute per provider. When usage exceeds 80%, the router shifts new requests to an alternate provider before hitting the limit.

TPM Window

60s

Token throughput tracked separately. Large prompts can consume the token budget even when request count is low. Both counters must be green to route.

Capacity-Aware Penalty

When a provider exceeds 90% of its rate limit window, it receives a 10x routing penalty — effectively deprioritised until capacity frees up. This prevents clustering requests on a nearly-full provider.

Model Access Groups

Not every plan tier gets every model. Model access groups gate which models are available based on the organisation's subscription tier — enforced at the routing layer, not the UI.

Starter

Efficient models — Claude Haiku 4.5, GPT-4o mini, Gemini 2.0 Flash

Pro

All Starter models + Claude Sonnet 4.6, GPT-4o, Gemini 2.5 Pro

Enterprise

All models — Claude Opus 4.6, GPT-4.5, o3 + BYO API keys

Access groups are enforced at the routing decision point. If an agent requests a model its organisation cannot access, the router substitutes the best available model from the allowed group — logged and visible in the observability dashboard.

<100ms

failover time

From failure detection to traffic shift

providers supported

Anthropic, OpenAI, Google, Cohere, Mistral

routing strategies

Priority, Price, Quality, Latency + Custom

Next: The Skill Engine

Routing gets requests to the right model. Skills give agents the capabilities to act. Discover how every skill is vetted before it can run.

Skill Engine → Register for Early Access →