LLMux unifies 20+ free-tier providers — text, image, audio, video — behind a single OpenAI-compatible gateway. Automatic fallback, smart routing, multi-key rotation, zero lock-in.
from openai import OpenAI # Just change base_url — your existing code works client = OpenAI( api_key="any", base_url="http://localhost:3000/v1", ) response = client.chat.completions.create( model="llama-3.3-70b", # or "gemini-flash", "deepseek-r1", ... messages=[{"role": "user", "content": "Hello!"}], ) print(response.choices[0].message.content)
import OpenAI from "openai"; // Change one line — everything else stays the same const client = new OpenAI({ apiKey: "any", baseURL: "http://localhost:3000/v1", }); const { choices } = await client.chat.completions.create({ model: "auto", // LLMux auto-picks the best available provider messages: [{ role: "user", content: "Write a poem" }], stream: true, });
# Text completion curl http://localhost:3000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"gemini-flash","messages":[{"role":"user","content":"Hi!"}]}' # Image generation curl http://localhost:3000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{"prompt":"a robot in a garden","model":"pollinations-flux"}' # List all models curl http://localhost:3000/v1/models
LLMux handles the hard parts so you can focus on building.
Drop-in replacement for the OpenAI API. Works with every OpenAI SDK, LangChain, LlamaIndex, and any tool that speaks OpenAI.
3-level fallback: retry the same provider, try another with the same model, then any compatible provider. Your app never goes down.
Add multiple API keys per provider. LLMux round-robins and auto-rotates on 429 errors. Triple your effective rate limit instantly.
Tracks RPM, RPD, TPM, and TPD per provider. In-memory by default, Redis-backed for multi-instance deployments.
SSE streaming proxied transparently across all text providers. Your UI gets tokens as fast as the provider sends them.
5 built-in strategies: priority, round-robin, least-busy, latency-based, random-weighted. Switch in one config line.
Text, image (FLUX, Seedream, GPT-Image), audio (TTS + STT), and video — all through the same OpenAI-compatible surface.
Real-time provider health, rate limit states, and request metrics at /dashboard. Know what's healthy before you need it.
Optional shared-secret auth on /v1/* routes. Deploy securely to your team without exposing underlying provider keys.
Other gateways were designed for paid-API workflows. LLMux was built from day one around the constraints of free-tier providers — rate limits, multi-key pools, and zero-downtime fallback.
| Feature | ⚡ LLMux | OpenRouter | LiteLLM | Portkey OSS |
|---|---|---|---|---|
| Per-key rotation on 429 | ✅ Built-in | ❌ | ❌ | ❌ |
| Rate-limit-aware routing | ✅ Full | ❌ | ⚠ Partial | ❌ |
| Multi-key pool per provider | ✅ Unlimited | ❌ | ❌ | ❌ |
| Self-hosted (no cloud dependency) | ✅ | ❌ Cloud-only | ✅ | ⚠ Limited |
| OpenAI-compatible API | ✅ | ✅ | ✅ | ✅ |
| Free-tier providers (20+) | ✅ 20+ | ⚠ Some | ✅ | ✅ |
| Automatic 3-level fallback | ✅ 3 levels | ⚠ Basic | ✅ | ✅ |
| Multiple routing strategies | ✅ 5 built-in | ❌ | ✅ | ⚠ Partial |
| Zero database required | ✅ | N/A | ❌ Needs DB | ✅ |
| TypeScript / Node.js | ✅ | N/A | ❌ Python | ✅ |
| Live operational dashboard | ✅ Built-in | ✅ Hosted | ⚠ Paid tier | ⚠ Basic |
| Text + Image + Audio + Video | ✅ All 4 | ✅ | ✅ | ✅ |
| Cost to self-host on free tiers | $0 | $0 + markup | $0 | $0 |
OpenRouter is a hosted cloud proxy — your traffic goes through their servers and you pay a per-token markup. LLMux runs on your hardware, rotates your own API keys, and costs zero per token on free tiers.
LiteLLM is excellent but Python-only, requires PostgreSQL + Redis for full features, and doesn't do per-key multi-key rotation on 429s. LLMux is TypeScript, runs with zero database, and key rotation is built into every provider.
Portkey's core observability and load balancing features require their hosted cloud service. The OSS version has limited support for multi-key rotation and budget-aware routing. LLMux is 100% self-contained by design.
The built-in dashboard at /dashboard gives you real-time insight into every request, every provider, and every rate limit — no setup required.
Live status dot, RPM usage bar, average latency, and tier — for every configured provider. Spot degraded providers before they affect users.
Last 100 requests with provider, model, status, latency, and token count. Fallback chains visible inline. Updates every 3 seconds.
RPM and RPD usage bars for each provider. Color shifts yellow → red as limits approach. Rotation events logged in real time.
The settings panel at /settings lets you manage providers, routing, and API keys — all from a browser.
See all providers at a glance. Enable/disable, edit API keys and rate limits, change routing tier, add new dynamic providers, and test connectivity — without touching YAML.
Status shows No API Key for providers that are enabled but missing credentials.
Switch the active routing strategy (priority, round-robin, least-busy, latency-based, random-weighted) with a single click. Changes take effect immediately — no restart needed.
Create and revoke named API keys for clients accessing /v1/*. Each key has a label, creation date, and can be revoked independently. Keys are shown once on creation.
Issue a separate gateway key for each app, team, or integration. Revoke one without affecting the rest. The GATEWAY_API_KEY env var remains active alongside UI-managed keys.
Navigate to /settings and click Gateway in the left sidebar.
Enter a label (e.g. production-app, vscode-ext) and click Generate.
The full key is displayed immediately after creation. Copy it now — it cannot be retrieved later.
Pass the key to any OpenAI SDK or HTTP client via Authorization: Bearer llmux-...
# List all keys curl http://localhost:3000/api/admin/gateway-keys \ -H "X-Admin-Token: $ADMIN_TOKEN" # Create a named key curl -X POST http://localhost:3000/api/admin/gateway-keys \ -H "X-Admin-Token: $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"name": "my-app"}' # Revoke a key curl -X DELETE http://localhost:3000/api/admin/gateway-keys/<id> \ -H "X-Admin-Token: $ADMIN_TOKEN"
from openai import OpenAI client = OpenAI( api_key="llmux-xxxxxxxxxxxxxxxxxxxx", base_url="http://localhost:3000/v1", ) # All /v1/* requests now require the gateway key response = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "Hello!"}], )
The built-in chat playground at /playground lets you interactively test providers and models without writing any code.
Select a specific provider from the dropdown to route directly to it — bypassing the normal routing strategy. Great for comparing providers side-by-side.
All models for the selected provider are listed in a second dropdown. The actual API model ID is sent to the provider; the friendly alias is shown in the UI. Status bar shows the provider name and model that served the last response.
Toggle streaming on or off. Streaming responses show a blinking cursor while tokens arrive, then render the complete output with a smooth fade-in.
Assistant responses are fully rendered as markdown — headings, bullet lists, numbered lists, fenced code blocks with syntax highlighting, tables, and blockquotes. Code blocks get a one-click copy button on hover.
Every playground conversation is counted in the dashboard stats — Total Requests, Tokens Used, and per-provider breakdowns. Playground requests are first-class citizens.
Full dark/light mode toggle. Preferences are persisted in localStorage and respected across dashboard, settings, and playground pages.
attention(Q, K, V) = softmax(QKᵀ/√d) · V
Every provider is configurable in a single YAML file. No code changes needed to add or swap providers.
| Provider | Tier | Top Models | Free Limits |
|---|---|---|---|
| Groq | ⭐ 1 | GPT-OSS 120B, GPT-OSS 20B, Llama 3.3 70B, Llama 4 Scout | 30 RPM / 14.4K RPD |
| Cerebras | ⭐ 1 | Llama 3.3 70B, Qwen3-32B | 30 RPM / 1M TPD |
| Google Gemini | ⭐ 1 | Gemini 3.1 Pro Preview, Gemini 3 Flash, 2.5 Pro (1M–2M ctx) | 15 RPM / 1K RPD |
| Mistral AI | ⭐ 1 | Mistral Large, Small 4 (256K), Magistral Medium, Codestral | ~2 RPM / 500K TPM |
| OpenRouter | ✦ 2 | DeepSeek R1, DeepSeek V3, Qwen3 Coder 480B, Gemini 2.5 Flash | 20 RPM / 50 RPD |
| Cloudflare Workers AI | ✦ 2 | Llama 3.3 70B, Llama 4 Scout, DeepSeek R1 | 10K neurons/day |
| Hugging Face | ✦ 2 | Llama 4 Scout, Llama 3.3 70B, Qwen3-235B | Daily limit |
| SambaNova | ✦ 2 | Llama 4 Maverick, Llama 4 Scout, DeepSeek R1/V3 | 20 RPM / 200K TPD |
| Cohere | ✦ 2 | Command A (256K), Command A Reasoning, Command A Vision | 1K calls/month |
| DeepSeek | ✦ 2 | DeepSeek V3.2, R1 (131K ctx) | 5M free tokens |
| NVIDIA NIM | ✦ 2 | Nemotron 3 Super 120B (1M ctx), Qwen3.5 397B, Mistral Small 4 | Free endpoints |
| GitHub Models | ✦ 2 | GPT-5.4, Claude Sonnet 4.6, Llama 4 Scout, DeepSeek R1 | 15 RPM / 150 RPD |
| Pollinations AI | ✦ 2 | openai (GPT-5 Mini), claude-fast, deepseek, mistral, kimi, glm, minimax, perplexity + 30 more | Free tier + API key |
| xAI Grok | ✦ 2 | Grok-3, Grok-3-Mini, Grok-3-Fast, Grok-3-Mini-Fast (131K ctx) | Paid — credits |
| Moonshot AI (Kimi) | ✦ 2 | Kimi K2.5 (1T MoE, vision), Kimi K2 Thinking | Free credits on signup |
| Zhipu AI (GLM) | ✦ 2 | GLM-5 (744B MoE), GLM-4-Flash | Free tier available |
| Provider | Tier | Models | Notes |
|---|---|---|---|
| Pollinations AI | ⭐ 1 | flux, zimage, klein, gptimage, wan-image, qwen-image, seedream5, gptimage-large, kontext | flux/zimage/klein/wan-image free |
| Cloudflare Workers AI | ⭐ 1 | FLUX.2 Klein, Flux Schnell, DreamShaper-8 | 10K neurons/day |
| Together AI | ✦ 2 | FLUX.1 Schnell Free, FLUX.1 Dev | Free endpoint |
| Hugging Face | ✦ 2 | FLUX.1 Schnell, SDXL | Rate-limited |
| fal.ai | ✦ 2 | FLUX Pro 1.1, Schnell | ~100 free credits |
| Provider | Mode | Models / Voices | Free Limits |
|---|---|---|---|
| Groq Whisper | STT | whisper-large-v3-turbo, distil-whisper-en | 20 RPM / 2K RPD |
| Groq PlayAI | TTS | PlayAI Dialog, Arabic | Rate-limited |
| ElevenLabs | TTS | Flash v2.5, Multilingual v2 | 10K chars/month |
| Deepgram | TTS + STT | Aura-2, Nova-3 | $200 free credit |
| Fish Audio | TTS | Speech 1.6, 1.5 | Daily limit |
| Pollinations AI | TTS + Music | ElevenLabs v3 (30+ voices), ACE-Step music gen | elevenlabs TTS free |
| Pollinations AI | STT | Whisper Large V3, Scribe v2 (90+ langs, diarization) | Free + API key |
| Provider | Models | Notes |
|---|---|---|
| Pollinations AI | ltx-2 (LTX-2.3 free), veo (Google Veo 3.1 Fast), wan (Wan 2.6), wan-fast (Wan 2.2), seedance, seedance-pro, nova-reel | ltx-2 free; others paid pollen credits |
| Replicate | minimax/video-01, tencent/hunyuan-video | Paid credits (free trial available) |
| Hugging Face | Video diffusion models | Rate-limited free tier |
Copy-paste examples for every modality, in Python, TypeScript, and bash.
from openai import OpenAI client = OpenAI(api_key="any", base_url="http://localhost:3000/v1") # Streaming response stream = client.chat.completions.create( model="gemini-flash", messages=[{"role": "user", "content": "Write a haiku about free APIs"}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True)
import OpenAI from "openai"; const client = new OpenAI({ apiKey: "any", baseURL: "http://localhost:3000/v1" }); // "auto" lets LLMux pick the best available provider const res = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: "Explain quantum entanglement simply" }], }); console.log(res.choices[0].message.content);
curl http://localhost:3000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1", "messages": [{"role":"user","content":"Prove sqrt(2) is irrational"}] }'
from openai import OpenAI client = OpenAI(api_key="any", base_url="http://localhost:3000/v1") img = client.images.generate( prompt="a robot painting a masterpiece, oil on canvas, dramatic lighting", model="pollinations-zimage", # or "cf-flux-schnell", "together-flux-schnell" size="1024x1024", ) print(img.data[0].url)
curl http://localhost:3000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "prompt": "a cyberpunk city at sunset with neon lights", "model": "pollinations-flux", "size": "1024x1024", "response_format": "url" }'
client = OpenAI(api_key="any", base_url="http://localhost:3000/v1") # Standard TTS (Pollinations, ElevenLabs, Deepgram, Groq PlayAI) audio = client.audio.speech.create( model="tts-1", # routes to best available TTS provider input="Welcome to LLMux, your unified AI gateway.", voice="nova", # LLMux, echo, fable, onyx, nova, shimmer + 25 more response_format="mp3", ) audio.stream_to_file("welcome.mp3") # Music generation via Pollinations elevenmusic music = client.audio.speech.create( model="pollinations-music", input="An upbeat electronic track with synth leads", voice="LLMux", ) music.stream_to_file("track.mp3")
curl http://localhost:3000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"model":"tts-1","input":"Hello from LLMux!","voice":"nova"}' \ --output speech.mp3
client = OpenAI(api_key="any", base_url="http://localhost:3000/v1") with open("meeting.mp3", "rb") as f: transcript = client.audio.transcriptions.create( model="whisper-large-v3-turbo", # Groq — fastest free STT file=f, response_format="json", ) print(transcript.text) # 90+ language support via Pollinations Scribe transcript = client.audio.transcriptions.create( model="pollinations-scribe", file=f, language="fr", # ISO-639-1 language code )
curl http://localhost:3000/v1/audio/transcriptions \ -F "file=@audio.mp3" \ -F "model=whisper-large-v3-turbo"
# List all available models with aliases curl http://localhost:3000/v1/models | jq '.data[].id' # Per-provider health + rate-limit state curl http://localhost:3000/health/providers | jq '.' # Gateway health curl http://localhost:3000/health
Add multiple API keys per provider. LLMux round-robins normally and auto-rotates on 429 errors — fully transparent to your client.
Sign up for 2-3 accounts at Groq, Gemini, or any provider. Each gets the full free quota.
Name them GROQ_API_KEY_1, GROQ_API_KEY_2, etc.
api_keys in providers.yamlSwitch from api_key to api_keys (plural) with the array of keys.
LLMux round-robins across keys, and instantly rotates to the next key on any 429 rate-limit error.
# Before: single key - id: groq-llama3-70b api_key: ${GROQ_API_KEY} # After: 3 keys = 3× the rate limit - id: groq-llama3-70b api_keys: - ${GROQ_API_KEY_1} # 30 RPM - ${GROQ_API_KEY_2} # 30 RPM - ${GROQ_API_KEY_3} # 30 RPM # Result: 90 effective RPM ↑
Change router.default_strategy in providers.yaml to switch strategies instantly.
Routes to Tier 1 first, then Tier 2, etc. Ensures the highest-priority providers are used first.
Cycles through all available providers evenly, distributing load across the entire pool.
Always routes to the provider with the fewest in-flight requests. Minimizes queue depth.
Tracks rolling average latency per provider and routes to the historically fastest one.
Tier-weighted random selection. Spreads load while still preferring higher-tier providers.
Every provider gets a tier (1–4). LLMux tries Tier 1 providers first, then falls back down. There's no cost or quality implication — you can run completely free providers at Tier 1.
When a provider fails, LLMux silently tries the next option. Your client sees a successful response — nothing else.
LLMux is a standard Node.js app. Deploy it anywhere you can run containers or Node.
The recommended way. Full control, Redis included via Compose.
docker build -t LLMux-gateway . docker run -p 3000:3000 \ --env-file .env \ -v $(pwd)/config:/app/config \ LLMux-gateway
One-click deploy. Set API keys as env vars, done.
# railway.toml already configured railway up # Set env vars in Railway dashboard
Global edge deployment. Sub-20ms latency from any region.
# vercel.json already configured
vercel deploy --prod
Full control. Use with PM2 or systemd for production stability.
pnpm build pm2 start dist/index.js \ --name LLMux-gateway
LLMux's plugin architecture makes adding new providers simple. If it has an API, it can be a provider.
import { BaseProvider } from "../base.js"; export class MyProvider extends BaseProvider { async chatCompletion(req, ctx) { const model = this.resolveModel(req.model); const res = await this.postJSON( `${this.config.baseUrl}/chat/completions`, { ...req, model }, ctx ); if (req.stream) return this.proxyStream(res, ctx); const data = await res.json(); // recordTokens updates rate-limit tracker AND dashboard stats await this.recordTokens(ctx, data.usage?.total_tokens ?? 0); return new Response(JSON.stringify(data), { headers: { "Content-Type": "application/json" } }); } }
import { MyProvider } from "./text/myprovider.js"; const PROVIDER_FACTORIES = { // existing providers... "my-provider-id": MyProvider, // must match id in providers.yaml };
- id: my-provider-id name: My Provider modality: text tier: 2 enabled: true requires_auth: true api_keys: # multi-key support built-in - ${MY_PROVIDER_KEY_1} - ${MY_PROVIDER_KEY_2} base_url: https://api.myprovider.com/v1 adapter: openai models: - id: my-model-large alias: my-large context_window: 128000 limits: rpm: 60 concurrency: 5 timeout: 30000 max_retries: 2
MY_PROVIDER_KEY_1=your-key-here MY_PROVIDER_KEY_2=optional-second-key
pnpm dev).
Clone, configure, deploy. Three commands to a production AI gateway that never goes down.
git clone https://github.com/shaik-shahansha/llmux
cd LLMux-llm-gateway && pnpm install
cp config/providers.yaml.example config/providers.yaml
# add your API keys...
pnpm dev