Model configuration
The inference router is the only service that receives OpenAI-compatible model requests from platform services. Agent runtime, guardrails, memory, knowledge, and the frontend route model calls through it. Routes resolve to one of three backends: a primary local vLLM, an optional vision-only vLLM, and OpenRouter.
Model selection is runtime, admin-configurable via the inference-router's
model_roles (role → model). Callers never name a concrete model; they send a
role alias (default, title, classifier, memory, profile_curation,
vision, guardrail, knowledge) and the router resolves it. The static
profile-TOML model-name knobs (default_model, title_model,
classifier_model, memory_llm_model, profile_curation_model) were removed
from the schema. What remains in the profile are deploy facts: which model the
bundled vLLM serves (vllm_model) and the vision model (vision_model).
Configuration lives in these places:
| Layer | File | What it sets |
|---|---|---|
| Profile | config/profiles/<name>.toml | Deploy facts: vllm_model (which model the bundled vLLM serves) and vision_model. |
| Role assignments | inference.model_roles via /v1/admin/inference/roles | Runtime role → model mapping; the only model-selection surface. |
| Postgres-backed provider catalog | inference.providers, inference.models | Runtime providers, encrypted API keys, enabled models, model capability flags. |
| Gateway admin proxy | /v1/admin/inference/providers* | Operator CRUD for providers and model exposure; rewrites to inference-router /v1/internal/*. |
| Seed YAML | deploy/config/inference-router/config.yaml | Initial providers + routes only when the catalog is empty. |
| Compose env | deploy/docker-compose.yml + deploy/.compose.env | Per-service deploy env (VLLM_MODEL, LOCAL_MODEL, VISION_MODEL). |
| Secrets | deploy/secrets/openrouter_api_key, deploy/secrets/hugging_face_hub_token | API credentials. |
deploy/.compose.env is rendered from the active profile by
scripts/render-compose-env.py; do not edit it by hand.
The inference router's current runtime source of truth is the
Postgres-backed provider catalog. On startup, seed.FromYAML reads
deploy/config/inference-router/config.yaml only if inference.providers is
empty; once any provider exists, the YAML seed is skipped. The in-memory
registry reloads from inference.providers / inference.models every 30
seconds and immediately after admin provider/model changes.
This page documents the admin Inference workspace at /admin/inference —
the consolidated single-page workspace that replaced the former "Models" and
"Inference providers" pages, combining upstream provider management with the
per-role model-assignment table.
Routing topology
The router enforces llm.vision_model for image-bearing requests: if a route
without supports_vision: true receives image content, the router rewrites
the body's model field to vision_model before backend resolution. Empty
vision_model → HTTP 400, no silent fallback.
Model roles
LLM consumers (agent-runtime, guardrail, memory, knowledge, the title and
classifier helpers, profile curation) never name a concrete model. They send a
role alias and the inference-router resolves it through model_roles. The
canonical aliases are:
| Role | Sent by | Purpose |
|---|---|---|
default | agent-runtime, knowledge | Main agent model. |
title | agent-runtime | Session title generation. |
classifier | agent-runtime | Sensitivity classifier. |
memory | memory | Memory fact extraction. |
profile_curation | agent-runtime | Background profile curation. |
vision | inference-router (image rewrite) | Image-bearing requests. |
guardrail | guardrail | Constitutional AI judge. |
knowledge | knowledge | Knowledge-side generation. |
Assign roles at runtime through the gateway admin proxy. Roles and providers
are also managed in the Inference admin workspace UI at /admin/inference,
not only via curl:
# List the current role → model mapping.
curl http://localhost:8080/v1/admin/inference/roles \
-H "Authorization: Bearer $TOKEN"
# Point the default role at an OpenRouter model.
curl -X PUT http://localhost:8080/v1/admin/inference/roles/default \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"model_id":"openai/gpt-5.5"}'
On first boot the router seeds every role to a model whose backend is actually
reachable. It prefers the bundled local vLLM model, and waits through vLLM's
warm-up (the model load can take minutes) before seeding, so a slow start
doesn't trip a premature fallback — fail-safe, local-by-default, free. If local
inference is not deployed on this box (the vLLM host doesn't resolve — e.g. a
CPU / dev-lite stack or an external-only deploy), it falls back to the first
enabled model from an external provider that has an API key configured. If
neither is available, roles start empty for the admin to configure. The seed is
idempotent: once any role exists it is never re-seeded, so later admin changes
are never clobbered. For dev stacks, make seed-dev-roles seeds the roles plus
a local vLLM provider so aliases resolve out of the box.
There are no silent fallbacks. A request for a role with no assignment returns
HTTP 503 (role "<name>" is not configured); a role mapped to a model with no
enabled backend returns HTTP 503 (model for role "<name>" has no enabled backend). Auxiliary callers (title, classifier, memory, profile curation) log
loudly when a role is unconfigured.
Deploy facts (compose env)
Which model the bundled vLLM serves and the vision model are deploy facts,
projected into compose env vars from the active profile. Verified against
deploy/docker-compose.yml:
| Env var | Default | Where it ships | Purpose |
|---|---|---|---|
VLLM_MODEL | google/gemma-4-E4B-it | vllm container | HF model id loaded into vLLM. |
LOCAL_MODEL | google/gemma-4-E4B-it | inference-router, agent-runtime | Local model route name (templated from VLLM_MODEL). |
VISION_MODEL | empty | inference-router | Vision route name; empty disables vision. |
The Helm chart (helm/aibox/values.yaml) ships an older vllm.model default
of "Qwen/Qwen3.5-2B" — operators overriding via Helm should align this with
the route names in deploy/config/inference-router/config.yaml.
Runtime catalog
The supported operator path is the gateway admin proxy:
# List providers
curl http://localhost:8080/v1/admin/inference/providers \
-H "Authorization: Bearer $TOKEN"
# Add a provider. The router normalizes base_url (trims whitespace and
# trailing slashes, defaults a missing scheme to https, strips pasted endpoint
# suffixes like /chat/completions, /responses, /models) and probes
# <base_url>/models before persisting it. Upstream redirects are never
# followed: a redirecting base_url (e.g. http:// on an https-only host) is
# rejected at creation with the redirect target, because a followed redirect
# would silently turn generation POSTs into body-less GETs.
curl -X POST http://localhost:8080/v1/admin/inference/providers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "openrouter-prod",
"kind": "chat_completions",
"base_url": "https://openrouter.ai/api/v1",
"api_key": "sk-..."
}'
# Pick which upstream models are exposed.
curl -X PATCH http://localhost:8080/v1/admin/inference/providers/<provider-id>/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"enabled_model_ids":["openai/gpt-5.5","anthropic/claude-sonnet-4.6"]}'
API keys are envelope-encrypted with INFERENCE_MASTER_KEY_FILE before they
are stored. GET responses return api_key_set, never plaintext keys.
Seed route reference
deploy/config/inference-router/config.yaml seeds an empty catalog with these
defaults. After the first successful seed, edit providers and enabled models
through /v1/admin/inference/providers* instead of expecting YAML edits to
replace the persisted catalog.
backends:
- name: vllm
url: http://vllm:8000
type: vllm
- name: vllm-vision
url: http://vllm-vision:8000
type: vllm
- name: openrouter
url: https://openrouter.ai/api/v1
type: external
api_key_env: OPENROUTER_API_KEY
reasoning_capable: true
reasoning_with_tools: true
routes:
- model: ${LOCAL_MODEL:-google/gemma-4-E4B-it} # backend: vllm
# No baked `default` route: "default" is a runtime role (see model_roles),
# not a config-file model, so an unconfigured default fails loud, not silent.
- model: qwen3-vl # backend: vllm-vision, supports_vision
- model: anthropic/claude-opus-4.7 # backend: openrouter, supports_vision
- model: anthropic/claude-sonnet-4.6 # backend: openrouter, supports_vision
- model: openai/gpt-5.5 # backend: openrouter, supports_vision
- model: xiaomi/mimo-v2.5-pro # backend: openrouter (cheap aux/helper)
- model: google/gemini-3.1-pro-preview # backend: openrouter, supports_vision
- model: z-ai/glm-5.1 # backend: openrouter
- model: moonshotai/kimi-k2.6 # backend: openrouter
- model: deepseek/deepseek-v3.2 # backend: openrouter
- model: minimax/minimax-m2.7 # backend: openrouter
The seed local route name is templated from LOCAL_MODEL so a fresh catalog
starts aligned with the model vLLM is serving. Persisted catalog rows can
then be managed independently through the admin API.
Use OpenRouter
-
Write the key to
deploy/secrets/openrouter_api_key(operator-supplied,empty_ok = truein the manifest). -
Add or update the provider through
/v1/admin/inference/providers*, then enable the OpenRouter model ids you want exposed. The router reloads the persisted catalog automatically; no restart is required for provider rows. -
Whether inference may leave the box is governed by the network egress gateway's allowlist, not a config toggle. Add the OpenRouter host through
/v1/admin/egress/allowlist(Squidexternal_acl, hot-reloaded, no restart) so the external provider is reachable. -
Point the roles you want served by OpenRouter at the enabled model ids:
curl -X PUT http://localhost:8080/v1/admin/inference/roles/guardrail \-H "Authorization: Bearer $TOKEN" \-H "Content-Type: application/json" \-d '{"model_id":"openai/gpt-5.5"}'Role assignments and provider/allowlist changes all take effect at runtime — no service restart. Restart services only if you changed env-backed deploy facts such as
VLLM_MODELorVISION_MODEL.
Reasoning options are gated on backend capability (reasoning_capable,
reasoning_with_tools). Usage metadata is requested generically via
stream_options.include_usage and include: ["usage"] where the upstream
honours it.
Use local vLLM
-
Register only the local vLLM provider via the admin API, and keep external providers (OpenRouter, etc.) off the network egress allowlist. There is no routing toggle — a box is "local" precisely when no external provider is reachable (enforced + audited at the egress gateway).
-
If the model is gated, write a token to
deploy/secrets/hugging_face_hub_token. -
Start the GPU profile:
make up GPU=single
The vLLM command in deploy/docker-compose.yml enables the gemma4 tool
parser and gemma4 reasoning parser plus
tool_chat_template_gemma4.jinja for the shipped
google/gemma-4-E4B-it default. Multi-vLLM (Gemma + Qwen behind OpenResty)
is available via make up GPU=multi.
Vision models
The single switch is [llm].vision_model in the active profile, projected
into VISION_MODEL for the router.
Path A — local Qwen3-VL (GPU box)
The vllm-vision container serves Qwen3-VL under the route name qwen3-vl
on its own compose profile (gpu-vision).
-
Active profile sets
[llm].vision_model = "qwen3-vl"(single-tenant defaults to this). -
Boot:
make use-env PROFILE=single-tenantmake up GPU=vision -
Verify:
curl http://localhost:8004/v1/models | jq '.data[] | select(.id == "qwen3-vl")'curl http://localhost:8005/health # raw vLLM, dev profile only
Default is Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 (~30B / ~3B active MoE, FP8).
Fits one 80 GB GPU at --gpu-memory-utilization 0.70.
| Variable | Default | Description |
|---|---|---|
VLLM_VISION_MODEL | Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 | HF repo id. |
VLLM_VISION_GPU_MEM_UTIL | 0.70 | Passed to vLLM. |
VLLM_VISION_TP_SIZE | 1 | Tensor-parallel size. |
VLLM_VISION_MAX_MODEL_LEN | 131072 | Max context tokens. |
VLLM_VISION_GPU_COUNT | 1 | NVIDIA GPUs reserved. |
VLLM_VISION_IMAGE | vllm/vllm-openai:v0.11.0 | Must be ≥ 0.11.0 for Qwen3-VL. |
The HF token comes from deploy/secrets/hugging_face_hub_token. For the
235B-A22B variant, set VLLM_VISION_TP_SIZE=8, VLLM_VISION_GPU_COUNT=8,
and provision 8 × 80 GB.
Path B — external vision route (CPU-only)
Set [llm].vision_model to an OpenRouter route marked supports_vision
(anthropic/claude-sonnet-4.6, openai/gpt-5.5,
google/gemini-3.1-pro-preview). The single-tenant profile defaults
vision_model to the local qwen3-vl route instead.
Path C — vision disabled (empty vision_model)
Image-bearing requests return HTTP 400:
{"error":"vision_model not configured: image-bearing requests require [llm] vision_model in inference-router config"}
This is deliberate — pre-1.0 the platform does not silently strip image content or downgrade.
Verify routing
curl http://localhost:8080/v1/models # client-facing model list
curl http://localhost:8080/v1/routes # route → backend mapping
Both go through the gateway. The router itself listens on
127.0.0.1:8004 only in dev (deploy/docker-compose.dev.yml).
Smoke test (vision)
AIBOX_VISION_SMOKE=1 python -m unittest tests/smoke/test_vision_smoke.py -v
Sends a 1×1 red PNG and asserts the answer mentions red. The test is skipped unless the env var is set, so the smoke test does not run in default CI jobs.
Troubleshooting
| Symptom | Check |
|---|---|
role "<name>" is not configured (503) | The role has no assignment. Set it via PUT /v1/admin/inference/roles/<name> (or run make seed-dev-roles in dev). |
model for role "<name>" has no enabled backend (503) | The role points at a model with no enabled provider/route. Enable the model via /v1/admin/inference/providers*. |
model not found | Model ID must appear in routes / be enabled in the catalog. |
OpenRouter 401 | OPENROUTER_API_KEY empty or invalid; the secret is operator-supplied. |
upstream redirected … set the provider base_url to the final URL | The stored base_url redirects (e.g. http:// on an https-only host). Re-add the provider with the final URL — and rotate the key, which already traveled over the pre-redirect URL. |
| vLLM OOM | Pick a smaller VLLM_MODEL, lower VLLM_GPU_MEM_UTIL, or fall back to OpenRouter. |
vision_model not configured 400 | Image content sent while [llm].vision_model is empty. Set it or remove the image. |
| No usage/cost data | OBSERVABILITY_URL, OTEL_EXPORTER_OTLP_ENDPOINT, service auth, and inference-router logs. |
Related
Verified against commit f862a4f8 (2026-06-16) · sources 1a20e45de3f5.