Model Configuration
The inference router is the model boundary for AI-in-a-Box. Agent runtime, guardrails, memory, knowledge, Dify seed workflows, and the frontend route model calls through its OpenAI-compatible API.
Configuration lives in:
deploy/.envdeploy/config/inference-router/config.yaml- the
inference-router,agent-runtime,guardrail,memory, andknowledgeservice sections indeploy/docker-compose.yml
Current Defaults
| Setting | Default | Purpose |
|---|---|---|
DEFAULT_MODEL | openai/gpt-5.4 | Main agent model for the default no-GPU path. |
TITLE_MODEL | openai/gpt-5.4-nano | Short call used for session titles. |
LOCAL_MODEL | google/gemma-4-E4B-it | Local model name used by agent-runtime when selecting the local path. |
VLLM_MODEL | google/gemma-4-E4B-it | Model loaded by the vLLM container. |
CLASSIFIER_MODEL | google/gemma-4-E4B-it | Sensitivity classifier model. |
MEMORY_LLM_MODEL | openai/gpt-5.4 in deploy/.env.example; compose fallback is meta-llama/llama-3.3-70b-instruct:free | Memory extraction/consolidation model. |
Route Configuration
The shipped router config has two backends:
backends:
- name: vllm
url: http://vllm:8000
type: vllm
- name: openrouter
url: https://openrouter.ai/api/v1
type: external
api_key_env: OPENROUTER_API_KEY
reasoning_capable: true
reasoning_with_tools: true
Routes map public model IDs to a backend:
routes:
- model: google/gemma-4-E4B-it
backend: vllm
- model: default
backend: vllm
- model: openai/gpt-5.4
backend: openrouter
Use backend name vllm, not the older vllm-1.
Use OpenRouter
- Set
OPENROUTER_API_KEYindeploy/.env. - Set
DEFAULT_MODELto a route configured on theopenrouterbackend. - Restart services that cache model config:
docker compose --env-file deploy/.env -f deploy/docker-compose.yml -f deploy/docker-compose.dev.yml restart inference-router agent-runtime knowledge
Restart guardrail after changing GUARDRAIL_LLM_MODEL, and restart
memory after changing MEMORY_LLM_MODEL.
The router gates reasoning options on backend capability. Usage metadata is
requested generically with stream_options.include_usage and include: ["usage"] where the upstream accepts it.
Use Local vLLM
- Set
VLLM_MODEL. - If the model is gated, set
HUGGING_FACE_HUB_TOKEN. - Start the GPU profile:
make up-gpu
The vLLM command enables Gemma tool-call and reasoning parsers for the shipped google/gemma-4-E4B-it default.
Verify Routing
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/routes
/v1/routes shows route-to-backend mapping. /v1/models shows the model list surfaced to clients.
Dify
Dify has its own provider configuration, but seeded AI-in-a-Box workflows are intended to use the inference router, not direct vLLM/OpenRouter endpoints. In Dify, configure an OpenAI-compatible provider at:
http://inference-router:8004/v1
Use the same model IDs configured in deploy/config/inference-router/config.yaml.
Troubleshooting
| Symptom | Check |
|---|---|
model not found | Model ID must appear in routes. Restart inference-router after editing config. |
OpenRouter 401 | OPENROUTER_API_KEY is empty or invalid. |
| vLLM OOM | Choose a smaller VLLM_MODEL, lower VLLM_GPU_MEM_UTIL, or use OpenRouter. |
| No usage/cost data | Check OBSERVABILITY_URL, LANGFUSE_*, and inference-router logs. |
| Dify calls bypass observability | Point Dify at http://inference-router:8004/v1, not directly at vLLM or OpenRouter. |