Skip to main content

Model configuration

The inference router is the only service that receives OpenAI-compatible model requests from platform services. Agent runtime, guardrails, memory, knowledge, and the frontend route model calls through it. Routes resolve to one of three backends: a primary local vLLM, an optional vision-only vLLM, and OpenRouter.

Model selection is runtime, admin-configurable via the inference-router's model_roles (role → model). Callers never name a concrete model; they send a role alias (default, title, classifier, memory, profile_curation, vision, guardrail, knowledge) and the router resolves it. The static profile-TOML model-name knobs (default_model, title_model, classifier_model, memory_llm_model, profile_curation_model) were removed from the schema. What remains in the profile are deploy facts: which model the bundled vLLM serves (vllm_model) and the vision model (vision_model).

Configuration lives in these places:

LayerFileWhat it sets
Profileconfig/profiles/<name>.tomlDeploy facts: vllm_model (which model the bundled vLLM serves) and vision_model.
Role assignmentsinference.model_roles via /v1/admin/inference/rolesRuntime role → model mapping; the only model-selection surface.
Postgres-backed provider cataloginference.providers, inference.modelsRuntime providers, encrypted API keys, enabled models, model capability flags.
Gateway admin proxy/v1/admin/inference/providers*Operator CRUD for providers and model exposure; rewrites to inference-router /v1/internal/*.
Seed YAMLdeploy/config/inference-router/config.yamlInitial providers + routes only when the catalog is empty.
Compose envdeploy/docker-compose.yml + deploy/.compose.envPer-service deploy env (VLLM_MODEL, LOCAL_MODEL, VISION_MODEL).
Secretsdeploy/secrets/openrouter_api_key, deploy/secrets/hugging_face_hub_tokenAPI credentials.

deploy/.compose.env is rendered from the active profile by scripts/render-compose-env.py; do not edit it by hand.

The inference router's current runtime source of truth is the Postgres-backed provider catalog. On startup, seed.FromYAML reads deploy/config/inference-router/config.yaml only if inference.providers is empty; once any provider exists, the YAML seed is skipped. The in-memory registry reloads from inference.providers / inference.models every 30 seconds and immediately after admin provider/model changes.

This page documents the admin Inference workspace at /admin/inference — the consolidated single-page workspace that replaced the former "Models" and "Inference providers" pages, combining upstream provider management with the per-role model-assignment table.

Routing topology

The router enforces llm.vision_model for image-bearing requests: if a route without supports_vision: true receives image content, the router rewrites the body's model field to vision_model before backend resolution. Empty vision_model → HTTP 400, no silent fallback.

Model roles

LLM consumers (agent-runtime, guardrail, memory, knowledge, the title and classifier helpers, profile curation) never name a concrete model. They send a role alias and the inference-router resolves it through model_roles. The canonical aliases are:

RoleSent byPurpose
defaultagent-runtime, knowledgeMain agent model.
titleagent-runtimeSession title generation.
classifieragent-runtimeSensitivity classifier.
memorymemoryMemory fact extraction.
profile_curationagent-runtimeBackground profile curation.
visioninference-router (image rewrite)Image-bearing requests.
guardrailguardrailConstitutional AI judge.
knowledgeknowledgeKnowledge-side generation.

Assign roles at runtime through the gateway admin proxy. Roles and providers are also managed in the Inference admin workspace UI at /admin/inference, not only via curl:

# List the current role → model mapping.
curl http://localhost:8080/v1/admin/inference/roles \
-H "Authorization: Bearer $TOKEN"

# Point the default role at an OpenRouter model.
curl -X PUT http://localhost:8080/v1/admin/inference/roles/default \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"model_id":"openai/gpt-5.5"}'

On first boot the router seeds every role to a model whose backend is actually reachable. It prefers the bundled local vLLM model, and waits through vLLM's warm-up (the model load can take minutes) before seeding, so a slow start doesn't trip a premature fallback — fail-safe, local-by-default, free. If local inference is not deployed on this box (the vLLM host doesn't resolve — e.g. a CPU / dev-lite stack or an external-only deploy), it falls back to the first enabled model from an external provider that has an API key configured. If neither is available, roles start empty for the admin to configure. The seed is idempotent: once any role exists it is never re-seeded, so later admin changes are never clobbered. For dev stacks, make seed-dev-roles seeds the roles plus a local vLLM provider so aliases resolve out of the box.

There are no silent fallbacks. A request for a role with no assignment returns HTTP 503 (role "<name>" is not configured); a role mapped to a model with no enabled backend returns HTTP 503 (model for role "<name>" has no enabled backend). Auxiliary callers (title, classifier, memory, profile curation) log loudly when a role is unconfigured.

Deploy facts (compose env)

Which model the bundled vLLM serves and the vision model are deploy facts, projected into compose env vars from the active profile. Verified against deploy/docker-compose.yml:

Env varDefaultWhere it shipsPurpose
VLLM_MODELgoogle/gemma-4-E4B-itvllm containerHF model id loaded into vLLM.
LOCAL_MODELgoogle/gemma-4-E4B-itinference-router, agent-runtimeLocal model route name (templated from VLLM_MODEL).
VISION_MODELemptyinference-routerVision route name; empty disables vision.

The Helm chart (helm/aibox/values.yaml) ships an older vllm.model default of "Qwen/Qwen3.5-2B" — operators overriding via Helm should align this with the route names in deploy/config/inference-router/config.yaml.

Runtime catalog

The supported operator path is the gateway admin proxy:

# List providers
curl http://localhost:8080/v1/admin/inference/providers \
-H "Authorization: Bearer $TOKEN"

# Add a provider. The router normalizes base_url (trims whitespace and
# trailing slashes, defaults a missing scheme to https, strips pasted endpoint
# suffixes like /chat/completions, /responses, /models) and probes
# <base_url>/models before persisting it. Upstream redirects are never
# followed: a redirecting base_url (e.g. http:// on an https-only host) is
# rejected at creation with the redirect target, because a followed redirect
# would silently turn generation POSTs into body-less GETs.
curl -X POST http://localhost:8080/v1/admin/inference/providers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "openrouter-prod",
"kind": "chat_completions",
"base_url": "https://openrouter.ai/api/v1",
"api_key": "sk-..."
}'

# Pick which upstream models are exposed.
curl -X PATCH http://localhost:8080/v1/admin/inference/providers/<provider-id>/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"enabled_model_ids":["openai/gpt-5.5","anthropic/claude-sonnet-4.6"]}'

API keys are envelope-encrypted with INFERENCE_MASTER_KEY_FILE before they are stored. GET responses return api_key_set, never plaintext keys.

Seed route reference

deploy/config/inference-router/config.yaml seeds an empty catalog with these defaults. After the first successful seed, edit providers and enabled models through /v1/admin/inference/providers* instead of expecting YAML edits to replace the persisted catalog.

backends:
- name: vllm
url: http://vllm:8000
type: vllm
- name: vllm-vision
url: http://vllm-vision:8000
type: vllm
- name: openrouter
url: https://openrouter.ai/api/v1
type: external
api_key_env: OPENROUTER_API_KEY
reasoning_capable: true
reasoning_with_tools: true

routes:
- model: ${LOCAL_MODEL:-google/gemma-4-E4B-it} # backend: vllm
# No baked `default` route: "default" is a runtime role (see model_roles),
# not a config-file model, so an unconfigured default fails loud, not silent.
- model: qwen3-vl # backend: vllm-vision, supports_vision
- model: anthropic/claude-opus-4.7 # backend: openrouter, supports_vision
- model: anthropic/claude-sonnet-4.6 # backend: openrouter, supports_vision
- model: openai/gpt-5.5 # backend: openrouter, supports_vision
- model: xiaomi/mimo-v2.5-pro # backend: openrouter (cheap aux/helper)
- model: google/gemini-3.1-pro-preview # backend: openrouter, supports_vision
- model: z-ai/glm-5.1 # backend: openrouter
- model: moonshotai/kimi-k2.6 # backend: openrouter
- model: deepseek/deepseek-v3.2 # backend: openrouter
- model: minimax/minimax-m2.7 # backend: openrouter

The seed local route name is templated from LOCAL_MODEL so a fresh catalog starts aligned with the model vLLM is serving. Persisted catalog rows can then be managed independently through the admin API.

Use OpenRouter

  1. Write the key to deploy/secrets/openrouter_api_key (operator-supplied, empty_ok = true in the manifest).

  2. Add or update the provider through /v1/admin/inference/providers*, then enable the OpenRouter model ids you want exposed. The router reloads the persisted catalog automatically; no restart is required for provider rows.

  3. Whether inference may leave the box is governed by the network egress gateway's allowlist, not a config toggle. Add the OpenRouter host through /v1/admin/egress/allowlist (Squid external_acl, hot-reloaded, no restart) so the external provider is reachable.

  4. Point the roles you want served by OpenRouter at the enabled model ids:

    curl -X PUT http://localhost:8080/v1/admin/inference/roles/guardrail \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"model_id":"openai/gpt-5.5"}'

    Role assignments and provider/allowlist changes all take effect at runtime — no service restart. Restart services only if you changed env-backed deploy facts such as VLLM_MODEL or VISION_MODEL.

Reasoning options are gated on backend capability (reasoning_capable, reasoning_with_tools). Usage metadata is requested generically via stream_options.include_usage and include: ["usage"] where the upstream honours it.

Use local vLLM

  1. Register only the local vLLM provider via the admin API, and keep external providers (OpenRouter, etc.) off the network egress allowlist. There is no routing toggle — a box is "local" precisely when no external provider is reachable (enforced + audited at the egress gateway).

  2. If the model is gated, write a token to deploy/secrets/hugging_face_hub_token.

  3. Start the GPU profile:

    make up GPU=single

The vLLM command in deploy/docker-compose.yml enables the gemma4 tool parser and gemma4 reasoning parser plus tool_chat_template_gemma4.jinja for the shipped google/gemma-4-E4B-it default. Multi-vLLM (Gemma + Qwen behind OpenResty) is available via make up GPU=multi.

Vision models

The single switch is [llm].vision_model in the active profile, projected into VISION_MODEL for the router.

Path A — local Qwen3-VL (GPU box)

The vllm-vision container serves Qwen3-VL under the route name qwen3-vl on its own compose profile (gpu-vision).

  1. Active profile sets [llm].vision_model = "qwen3-vl" (single-tenant defaults to this).

  2. Boot:

    make use-env PROFILE=single-tenant
    make up GPU=vision
  3. Verify:

    curl http://localhost:8004/v1/models | jq '.data[] | select(.id == "qwen3-vl")'
    curl http://localhost:8005/health # raw vLLM, dev profile only

Default is Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 (~30B / ~3B active MoE, FP8). Fits one 80 GB GPU at --gpu-memory-utilization 0.70.

VariableDefaultDescription
VLLM_VISION_MODELQwen/Qwen3-VL-30B-A3B-Instruct-FP8HF repo id.
VLLM_VISION_GPU_MEM_UTIL0.70Passed to vLLM.
VLLM_VISION_TP_SIZE1Tensor-parallel size.
VLLM_VISION_MAX_MODEL_LEN131072Max context tokens.
VLLM_VISION_GPU_COUNT1NVIDIA GPUs reserved.
VLLM_VISION_IMAGEvllm/vllm-openai:v0.11.0Must be ≥ 0.11.0 for Qwen3-VL.

The HF token comes from deploy/secrets/hugging_face_hub_token. For the 235B-A22B variant, set VLLM_VISION_TP_SIZE=8, VLLM_VISION_GPU_COUNT=8, and provision 8 × 80 GB.

Path B — external vision route (CPU-only)

Set [llm].vision_model to an OpenRouter route marked supports_vision (anthropic/claude-sonnet-4.6, openai/gpt-5.5, google/gemini-3.1-pro-preview). The single-tenant profile defaults vision_model to the local qwen3-vl route instead.

Path C — vision disabled (empty vision_model)

Image-bearing requests return HTTP 400:

{"error":"vision_model not configured: image-bearing requests require [llm] vision_model in inference-router config"}

This is deliberate — pre-1.0 the platform does not silently strip image content or downgrade.

Verify routing

curl http://localhost:8080/v1/models # client-facing model list
curl http://localhost:8080/v1/routes # route → backend mapping

Both go through the gateway. The router itself listens on 127.0.0.1:8004 only in dev (deploy/docker-compose.dev.yml).

Smoke test (vision)

AIBOX_VISION_SMOKE=1 python -m unittest tests/smoke/test_vision_smoke.py -v

Sends a 1×1 red PNG and asserts the answer mentions red. The test is skipped unless the env var is set, so the smoke test does not run in default CI jobs.

Troubleshooting

SymptomCheck
role "<name>" is not configured (503)The role has no assignment. Set it via PUT /v1/admin/inference/roles/<name> (or run make seed-dev-roles in dev).
model for role "<name>" has no enabled backend (503)The role points at a model with no enabled provider/route. Enable the model via /v1/admin/inference/providers*.
model not foundModel ID must appear in routes / be enabled in the catalog.
OpenRouter 401OPENROUTER_API_KEY empty or invalid; the secret is operator-supplied.
upstream redirected … set the provider base_url to the final URLThe stored base_url redirects (e.g. http:// on an https-only host). Re-add the provider with the final URL — and rotate the key, which already traveled over the pre-redirect URL.
vLLM OOMPick a smaller VLLM_MODEL, lower VLLM_GPU_MEM_UTIL, or fall back to OpenRouter.
vision_model not configured 400Image content sent while [llm].vision_model is empty. Set it or remove the image.
No usage/cost dataOBSERVABILITY_URL, OTEL_EXPORTER_OTLP_ENDPOINT, service auth, and inference-router logs.

Verified against commit f862a4f8 (2026-06-16) · sources 1a20e45de3f5.