Skip to main content

Model Configuration

The inference router is the model boundary for AI-in-a-Box. Agent runtime, guardrails, memory, knowledge, Dify seed workflows, and the frontend route model calls through its OpenAI-compatible API.

Configuration lives in:

  • deploy/.env
  • deploy/config/inference-router/config.yaml
  • the inference-router, agent-runtime, guardrail, memory, and knowledge service sections in deploy/docker-compose.yml

Current Defaults

SettingDefaultPurpose
DEFAULT_MODELopenai/gpt-5.4Main agent model for the default no-GPU path.
TITLE_MODELopenai/gpt-5.4-nanoShort call used for session titles.
LOCAL_MODELgoogle/gemma-4-E4B-itLocal model name used by agent-runtime when selecting the local path.
VLLM_MODELgoogle/gemma-4-E4B-itModel loaded by the vLLM container.
CLASSIFIER_MODELgoogle/gemma-4-E4B-itSensitivity classifier model.
MEMORY_LLM_MODELopenai/gpt-5.4 in deploy/.env.example; compose fallback is meta-llama/llama-3.3-70b-instruct:freeMemory extraction/consolidation model.

Route Configuration

The shipped router config has two backends:

backends:
- name: vllm
url: http://vllm:8000
type: vllm
- name: openrouter
url: https://openrouter.ai/api/v1
type: external
api_key_env: OPENROUTER_API_KEY
reasoning_capable: true
reasoning_with_tools: true

Routes map public model IDs to a backend:

routes:
- model: google/gemma-4-E4B-it
backend: vllm
- model: default
backend: vllm
- model: openai/gpt-5.4
backend: openrouter

Use backend name vllm, not the older vllm-1.

Use OpenRouter

  1. Set OPENROUTER_API_KEY in deploy/.env.
  2. Set DEFAULT_MODEL to a route configured on the openrouter backend.
  3. Restart services that cache model config:
docker compose --env-file deploy/.env -f deploy/docker-compose.yml -f deploy/docker-compose.dev.yml restart inference-router agent-runtime knowledge

Restart guardrail after changing GUARDRAIL_LLM_MODEL, and restart memory after changing MEMORY_LLM_MODEL.

The router gates reasoning options on backend capability. Usage metadata is requested generically with stream_options.include_usage and include: ["usage"] where the upstream accepts it.

Use Local vLLM

  1. Set VLLM_MODEL.
  2. If the model is gated, set HUGGING_FACE_HUB_TOKEN.
  3. Start the GPU profile:
make up-gpu

The vLLM command enables Gemma tool-call and reasoning parsers for the shipped google/gemma-4-E4B-it default.

Verify Routing

curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/routes

/v1/routes shows route-to-backend mapping. /v1/models shows the model list surfaced to clients.

Dify

Dify has its own provider configuration, but seeded AI-in-a-Box workflows are intended to use the inference router, not direct vLLM/OpenRouter endpoints. In Dify, configure an OpenAI-compatible provider at:

http://inference-router:8004/v1

Use the same model IDs configured in deploy/config/inference-router/config.yaml.

Troubleshooting

SymptomCheck
model not foundModel ID must appear in routes. Restart inference-router after editing config.
OpenRouter 401OPENROUTER_API_KEY is empty or invalid.
vLLM OOMChoose a smaller VLLM_MODEL, lower VLLM_GPU_MEM_UTIL, or use OpenRouter.
No usage/cost dataCheck OBSERVABILITY_URL, LANGFUSE_*, and inference-router logs.
Dify calls bypass observabilityPoint Dify at http://inference-router:8004/v1, not directly at vLLM or OpenRouter.