---
created: 2026-05-09
tags:
rastr
  - research
  - AI
  - infrastructure
  - comparison
updated: 2026-05-15T17:58
updated: 2026-05-15T17:14
---
# AI Gateway / LLM Proxy Comparison (2026)

## What Is an AI Gateway?

An AI gateway sits between your application and LLM providers (Anthropic, OpenAI, Google, etc.). It handles routing, failover, cost tracking, rate limiting, caching, and observability — so your app code just makes one API call and the gateway handles the rest.

## Why You Need One

- **Failover**: If Anthropic is down, auto-route to OpenAI
- **Cost control**: Track spend per project/team, set budgets
- **Rate limit management**: Distribute across multiple API keys
- **Caching**: Don't pay twice for the same prompt
- **Observability**: See every request, latency, token usage
- **Model flexibility**: Swap models without code changes

---

## The Categories

### API Gateways (Route requests to models)
- OpenRouter, LiteLLM, Bifrost, Portkey, RelayPlane

### Intelligent Routers (Pick the best model per request)
- NotDiamond, Martian, Unify AI, RouterLLM, TensorZero

### Enterprise Orchestration (Governance, teams, compliance)
- Nexos AI, TrueFoundry, Kong AI Gateway

### Managed/CDN (Zero infrastructure)
- OpenRouter, Cloudflare AI Gateway, AWS Bedrock

### Observability-First (Logging, tracing, evals)
- Helicone, Langfuse, Braintrust

### Memory/Context (Different category)
- Backboard.io (portable memory across LLMs)

### Frameworks (Not gateways)
- LangChain, LlamaIndex, CrewAI

---

## Detailed Comparison

### Bifrost (by Maxim AI)
- **Language**: Go
- **Latency overhead**: ~8 microseconds at 5,000 RPS
- **Self-hosted**: Yes, open-source (Apache 2.0)
- **Models**: 1000+ via 15+ providers
- **Setup**: `docker run -p 8080:8080 maximhq/bifrost` or `npx @maximhq/bifrost`
- **UI**: Built-in web dashboard (config, monitoring, analytics)
- **Auth**: Username/password + virtual keys + SSO
- **Key features**:
  - 50x faster than LiteLLM
  - Zero-config startup (auto-detects env var API keys)
  - Semantic caching
  - MCP Gateway (enterprise)
  - Budget management with virtual keys
  - Prometheus metrics
  - Automatic failover between providers
- **Weaknesses**:
  - Newer, smaller community (~3K GitHub stars)
  - MCP/tools features gated to enterprise
  - Model catalog can lag behind provider releases
  - Virtual key model restrictions are finicky
- **Best for**: Speed-critical applications, teams wanting Go performance with zero config
- **Status on QNAP**: ✅ Running at https://bifrost.disorganized.net

---

### LiteLLM
- **Language**: Python
- **Latency overhead**: ~4ms
- **Self-hosted**: Yes, open-source (MIT)
- **Models**: 100+ models across all major providers
- **Setup**: Docker image + config.yaml + postgres (for full features)
- **UI**: Admin dashboard at /ui (keys, usage, logs, models, teams)
- **Auth**: Master key + virtual keys per user/team
- **Key features**:
  - Most provider support (every model you can think of)
  - Biggest community (40K+ GitHub stars)
  - Virtual keys with per-key budgets and rate limits
  - Cost tracking per key/team/model
  - Fallback chains (if model A fails → try model B)
  - Full admin dashboard
  - Extensive logging
  - Supports tools/function calling passthrough
- **Weaknesses**:
  - Python = slower (GIL limits concurrency)
  - Heavy Docker image (~500MB)
  - Complex configuration for advanced features
  - Database required for key management/UI
- **Best for**: Maximum flexibility, teams needing detailed cost tracking, multi-tenant setups
- **Status on QNAP**: ✅ Running at https://litellm.disorganized.net

---

### OpenRouter
- **Type**: Managed SaaS (not self-hostable)
- **Latency overhead**: 15-30ms (network hop to their servers)
- **Models**: 200+ from all providers
- **Setup**: Sign up, get API key, point your app at their endpoint
- **Auth**: API key (Bearer token)
- **Key features**:
  - Zero infrastructure to manage
  - Auto-failover between providers (invisible to you)
  - Pay-per-token (no monthly commitment)
  - Model discovery (try any model instantly)
  - Usage dashboard
- **Weaknesses**:
  - Markup on token prices (they take a cut)
  - Can't self-host (vendor dependency)
  - Less control over routing logic
  - Your data goes through their servers
- **Best for**: Quick prototyping, trying many models, when you don't want to manage infrastructure
- **Status**: Not deployed (managed service, just sign up if wanted)

---

### Portkey
- **Language**: TypeScript
- **Latency overhead**: ~5ms
- **Self-hosted**: Partial (open-source gateway, managed platform for full features)
- **Models**: 200+
- **Key features**:
  - Enterprise governance (teams, budgets, approvals)
  - Guardrails (content filtering, PII detection)
  - MCP support for tool use
  - Caching (semantic + exact)
  - Detailed analytics dashboard
  - Canary deployments for model rollouts
- **Weaknesses**:
  - Full features require managed platform
  - More complex than needed for personal/small team
  - TypeScript = middle ground performance
- **Best for**: Enterprise teams needing governance, guardrails, and compliance

---

### TensorZero
- **Language**: Rust
- **Latency overhead**: ~0.3ms
- **Self-hosted**: Yes, open-source
- **Models**: Configurable
- **Key features**:
  - **ML-optimized routing** — learns which model performs best for YOUR specific prompts
  - Extremely fast (Rust)
  - A/B testing built in
  - Optimizes for quality AND cost simultaneously
  - Feedback loop: you rate responses, it learns
- **Weaknesses**:
  - Niche — focused on optimization, not general gateway
  - Smaller community
  - Needs training data (cold start problem)
- **Best for**: When you want to auto-discover which model handles each task best. Interesting for FLUX (which model classifies news best vs which strategizes best?)

---

### Helicone
- **Language**: Rust
- **Latency overhead**: ~5ms
- **Self-hosted**: Yes, open-source
- **Key features**:
  - Best-in-class observability dashboard
  - Every request logged with full detail
  - Cost tracking, latency percentiles
  - User-level analytics
  - Beautiful UI
- **Weaknesses**:
  - Routing is basic (not as sophisticated as LiteLLM)
  - Fewer providers supported
  - Primary focus is observability, not routing
- **Best for**: Teams whose main pain is "we can't see what's happening with our LLM calls"

---

### Cloudflare AI Gateway
- **Type**: Managed (part of Cloudflare)
- **Latency**: 10-20ms
- **Models**: ~20 supported
- **Key features**:
  - Free tier
  - Built into Cloudflare (if you already use them)
  - Response caching at the edge
  - Rate limiting
- **Weaknesses**:
  - Limited model support
  - Can't self-host
  - Basic features compared to dedicated gateways
- **Best for**: Cloudflare-native teams wanting basic caching + rate limiting

---

### NotDiamond
- **Type**: Managed smart router
- **Key features**:
  - AI that picks which AI to use per request
  - Optimizes for quality/cost/speed tradeoff
  - Claims to outperform any single model
- **Weaknesses**:
  - Can't self-host
  - Adds latency (needs to "think" about routing)
  - External dependency
- **Best for**: If you trust an AI to pick the best AI for each request

---

### Martian
- **Type**: Managed smart router
- **Similar to NotDiamond** — routes to cheapest model that can handle the task
- **Best for**: Cost optimization across many models

---

### Backboard.io
- **Category**: NOT a gateway — it's portable AI memory
- **What it does**: Lets you carry conversation context/memory across different LLMs
- **Why it matters**: Switch from Claude to GPT without losing context
- **Relevance to us**: OpenClaw's QMD memory system already solves this for our use case

---

### LangChain
- **Category**: NOT a gateway — it's an agent orchestration framework
- **What it does**: Chains, agents, tool use, RAG
- **Competes with**: FLUX's intelligence pipeline, not with Bifrost/LiteLLM
- **My take**: Bloated, over-abstracted. FLUX's custom pipeline is leaner and purpose-built.

---

## Performance Comparison Table

| Gateway | Language | P50 Latency | P95 Latency | Max RPS (single instance) |
|---------|----------|-------------|-------------|--------------------------|
| Bifrost | Go | ~8μs | ~11μs | 5,000+ |
| TensorZero | Rust | ~0.3ms | <1ms | 10,000+ |
| Kong AI | Lua/Go | ~3ms | ~8ms | ~3,000 |
| LiteLLM | Python | ~4ms | ~8ms | ~1,000 |
| Helicone | Rust | ~5ms | ~8ms | ~3,000 |
| Portkey | TypeScript | ~5ms | ~12ms | ~2,000 |
| Cloudflare | Managed | ~10-20ms | ~40ms | N/A |
| OpenRouter | Managed | ~15-30ms | ~50ms | N/A |

---

## Feature Matrix

| Feature | Bifrost | LiteLLM | OpenRouter | Portkey | TensorZero | Helicone |
|---------|---------|---------|------------|---------|-----------|----------|
| Self-host | ✅ | ✅ | ❌ | Partial | ✅ | ✅ |
| OpenAI-compatible API | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Auto failover | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| Semantic caching | ✅ | Manual | ❌ | ✅ | ❌ | ❌ |
| Budget controls | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
| Virtual keys | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
| Admin UI | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| Cost tracking | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Guardrails | Enterprise | ❌ | ❌ | ✅ | ❌ | ❌ |
| MCP/Tools | Enterprise | ❌ | ❌ | ✅ | ❌ | ❌ |
| ML routing | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| Prometheus metrics | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ |

---

## Recommendation for Our Setup

**Keep both Bifrost and LiteLLM running.** They serve different purposes:

- **Bifrost** → FLUX pipeline (speed matters for trading, minimal overhead)
- **LiteLLM** → everything else (OpenClaw agents, Campdesk AI, experiments — cost tracking and virtual keys useful)

If forced to pick one: **LiteLLM** has more features for the slight performance trade-off (which is irrelevant when the LLM call takes 5-30 seconds anyway).

**Future consideration**: Add **TensorZero** to auto-learn which model handles classification vs strategy generation best in the FLUX pipeline. Could save money by routing simple tasks to cheaper models while keeping complex reasoning on opus-4-7.

---

## URLs (Live on QNAP)

- Bifrost: https://bifrost.disorganized.net
- LiteLLM: https://litellm.disorganized.net/ui
- LiteLLM API key: `sk-flux-litellm-2026`
- Bifrost auth: see 1Password "Bifrost" item
