Building a Self-Hosted AI Platform from Scratch

April 11, 2026 · 5 min read

Most AI platforms assume you are comfortable sending your data to someone else's servers. For a growing number of organizations, that assumption is wrong. Regulated industries, defense contractors, research labs, and privacy-conscious companies need AI capabilities where no data leaves their network. That is the problem AI-in-a-Box was built to solve.

Current architecture: This post describes the product thesis and early architecture. The current service map, auth model, subagent model, and receipt system are documented in the Architecture reference, Authentication reference, and Audit Trail reference.

The sovereign AI thesis

"Sovereign AI" is not a marketing term. It describes a concrete constraint: inference, data storage, and agent orchestration can run on infrastructure you own and control. A strict deployment can keep model calls local; the default developer path can also route through external providers such as OpenRouter when the operator configures that route.

This matters for three reasons:

Regulatory compliance. HIPAA, PCI-DSS, ITAR, and similar frameworks impose strict controls on where data can be processed. Sending patient records or classified documents to an external LLM endpoint is a non-starter.
Latency and availability. If your AI workflows depend on an external API, you inherit that API's outage schedule. A self-hosted inference stack on local GPUs gives you low-latency inference with no dependency on someone else's uptime.
Cost at scale. API pricing works fine for prototyping. At thousands of requests per hour, running your own vLLM instance on commodity hardware is dramatically cheaper.

The design principle is "local-first, not local-only." The platform can optionally route requests to external providers like OpenRouter, but only when an administrator explicitly enables it and only through a policy-gated external gateway that logs every outbound call.

Architecture overview

AI-in-a-Box is a microservices platform. Each subsystem is an independent service with its own API, communicating via REST and async events. Here is the high-level layout:

User -> Frontend (React/Vite) -> API Gateway (Go)
  -> Agent Runtime (Python)
    -> Guardrail Service (input check)
    -> Memory Service (context retrieval)
    -> Inference Router -> vLLM / External Gateway (OpenRouter)
    -> Guardrail Service (output check)
    -> Memory Service (store new memories)
  -> API Gateway -> Frontend (streamed response via SSE)

The key services:

API Gateway (Go): OIDC authentication, rate limiting, SSE streaming, audit logging. Every request gets a correlation ID and tenant extraction from JWT claims.
Agent Runtime (Python, OpenAI Agents SDK): The brain. Handles chat, typed tool dispatch, SKILL.md loading, and Delegate-based subagent dispatch.
Inference Router (Go): A lightweight proxy between the agent runtime and inference backends. Maintains a model registry mapping public model IDs such as google/gemma-4-E4B-it or openai/gpt-5.4 to backends such as vllm and openrouter, health-checks backends, and exposes a unified OpenAI-compatible API.
Guardrail Service (Python): Constitutional AI layer with pluggable backends. Input guardrails handle jailbreak detection, PII redaction, and topic control. Output guardrails enforce content policy and flag hallucinations. Policies are scoped hierarchically: global, then tenant, then per-agent. Lower levels can only restrict, never relax.
Memory Service (Python): Typed memory powered by Mem0 and Qdrant. Four memory types (user, feedback, project, reference) are stored as vector embeddings in per-tenant Qdrant collections. The message buffer lives in Redis as a sliding window. Memories are automatically extracted from conversations and retrievable via semantic search.
Code Sandbox: Per-session Docker containers with hardened security (all capabilities dropped, PID limits, no network, non-root user). Optional GPU passthrough for ML workloads.

What makes it different

There are other self-hosted AI tools. Dify provides visual workflow building. Open WebUI gives you a chat interface. LocalAI wraps inference. AI-in-a-Box is not trying to replace any one of these. It integrates with Dify for visual workflows and runs its own inference via vLLM. The difference is in what it adds on top:

Multi-tenant isolation from day one. Every database table has a tenant_id column. Vector collections are per-tenant. Memory scopes are per-tenant. The gateway extracts tenant identity from OIDC tokens. This is not an afterthought bolted onto a single-user system.

Enterprise compliance built in. The audit service produces hash-chained, HMAC-signed immutable logs suitable for SOC2, HIPAA, and PCI-DSS audits. Every agent action, every guardrail decision, every external API call is logged with a tamper-evident chain.

Agent orchestration, not just chat. The agent runtime supports adaptive single-agent conversations, Delegate-based typed subagents for focused independent work, and visual workflows via Dify. Agents have typed memories and load skills on demand using the SKILL.md format.

Inference flexibility. The inference router supports vLLM for local GPU serving and policy-gated external providers via OpenRouter. Models are pre-staged from MinIO or local storage, supporting fully air-gapped deployments with no need to download from HuggingFace at runtime.

Deployment model

The platform deploys two ways:

Docker Compose for single-node or small team setups. One docker-compose.yml brings up all services plus infrastructure (PostgreSQL, Redis, Qdrant, MinIO, Keycloak, Langfuse). Works on a single machine with one or more GPUs.

Helm Charts for Kubernetes. Same container images, Kubernetes-native: Deployments with HPAs, GPU scheduling via NVIDIA device plugin, persistent volumes, Ingress, ConfigMaps, and optional service mesh for mTLS.

The smallest useful deployment is a single machine with an RTX 3070 or better, running a 2B-7B parameter model via vLLM. The largest is a multi-node Kubernetes cluster with dedicated vLLM instances per tenant.

What comes next

The platform is functional today for chat, multi-agent workflows, RAG, code execution, web search, and memory. The next priorities are:

Model management UI for uploading and registering models
Skill marketplace for sharing SKILL.md packages across tenants
Voice and multimodal input support
Automatic memory summarization and compression for long-running agents

The code is open source under MIT. If you are building sovereign AI infrastructure, we would like to hear what you need.

The sovereign AI thesis​

Architecture overview​

What makes it different​

Deployment model​

What comes next​

The sovereign AI thesis

Architecture overview

What makes it different

Deployment model

What comes next