Model Configuration

The inference router is the model boundary for AI-in-a-Box. Agent runtime, guardrails, memory, knowledge, Dify seed workflows, and the frontend route model calls through its OpenAI-compatible API.

Configuration lives in:

deploy/.env
deploy/config/inference-router/config.yaml
the inference-router, agent-runtime, guardrail, memory, and knowledge service sections in deploy/docker-compose.yml

Current Defaults

Setting	Default	Purpose
`DEFAULT_MODEL`	`openai/gpt-5.4`	Main agent model for the default no-GPU path.
`TITLE_MODEL`	`openai/gpt-5.4-nano`	Short call used for session titles.
`LOCAL_MODEL`	`google/gemma-4-E4B-it`	Local model name used by agent-runtime when selecting the local path.
`VLLM_MODEL`	`google/gemma-4-E4B-it`	Model loaded by the vLLM container.
`CLASSIFIER_MODEL`	`google/gemma-4-E4B-it`	Sensitivity classifier model.
`MEMORY_LLM_MODEL`	`openai/gpt-5.4` in `deploy/.env.example`; compose fallback is `meta-llama/llama-3.3-70b-instruct:free`	Memory extraction/consolidation model.

Route Configuration

The shipped router config has two backends:

backends:
  - name: vllm
    url: http://vllm:8000
    type: vllm
  - name: openrouter
    url: https://openrouter.ai/api/v1
    type: external
    api_key_env: OPENROUTER_API_KEY
    reasoning_capable: true
    reasoning_with_tools: true

Routes map public model IDs to a backend:

routes:
  - model: google/gemma-4-E4B-it
    backend: vllm
  - model: default
    backend: vllm
  - model: openai/gpt-5.4
    backend: openrouter

Use backend name vllm, not the older vllm-1.

Use OpenRouter

Set OPENROUTER_API_KEY in deploy/.env.
Set DEFAULT_MODEL to a route configured on the openrouter backend.
Restart services that cache model config:

docker compose --env-file deploy/.env -f deploy/docker-compose.yml -f deploy/docker-compose.dev.yml restart inference-router agent-runtime knowledge

Restart guardrail after changing GUARDRAIL_LLM_MODEL, and restart memory after changing MEMORY_LLM_MODEL.

The router gates reasoning options on backend capability. Usage metadata is requested generically with stream_options.include_usage and include: ["usage"] where the upstream accepts it.

Use Local vLLM

Set VLLM_MODEL.
If the model is gated, set HUGGING_FACE_HUB_TOKEN.
Start the GPU profile:

make up-gpu

The vLLM command enables Gemma tool-call and reasoning parsers for the shipped google/gemma-4-E4B-it default.

Verify Routing

curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/routes

/v1/routes shows route-to-backend mapping. /v1/models shows the model list surfaced to clients.

Dify

Dify has its own provider configuration, but seeded AI-in-a-Box workflows are intended to use the inference router, not direct vLLM/OpenRouter endpoints. In Dify, configure an OpenAI-compatible provider at:

http://inference-router:8004/v1

Use the same model IDs configured in deploy/config/inference-router/config.yaml.

Troubleshooting

Symptom	Check
`model not found`	Model ID must appear in `routes`. Restart `inference-router` after editing config.
OpenRouter `401`	`OPENROUTER_API_KEY` is empty or invalid.
vLLM OOM	Choose a smaller `VLLM_MODEL`, lower `VLLM_GPU_MEM_UTIL`, or use OpenRouter.
No usage/cost data	Check `OBSERVABILITY_URL`, `LANGFUSE_*`, and inference-router logs.
Dify calls bypass observability	Point Dify at `http://inference-router:8004/v1`, not directly at vLLM or OpenRouter.

Current Defaults​

Route Configuration​

Use OpenRouter​

Use Local vLLM​

Verify Routing​

Dify​

Troubleshooting​