Skip to main content

Per-Session Docker Sandboxes: Isolated Code Execution for AI Agents

· 6 min read

When your AI agent can execute arbitrary code, the execution environment is a security boundary. Shared subprocess execution on the host is fine for single-user prototypes. For a multi-tenant platform where different organizations share the same infrastructure, it is not even close to acceptable.

Current reference: This post explains the design intent behind sandbox isolation. The current API and deployment knobs are documented in the Code Sandbox reference and Run Code tutorial.

We built a per-session Docker sandbox system that gives each user session its own isolated container.

Why subprocess execution fails at multi-tenant

The simplest code execution approach is subprocess.run() on the host. Many AI tools start here. The problems show up quickly:

  • No isolation. One tenant's code can read another tenant's files, environment variables, and process memory.
  • Resource exhaustion. A fork bomb or memory leak in one session takes down the entire host.
  • Privilege escalation. If the host process runs as root (common in Docker-in-Docker setups), executed code inherits those privileges.
  • No cleanup. Files, processes, and network connections persist between sessions.

You could sandbox with something like nsjail or gVisor, but we wanted per-session state persistence (files created in one code execution are available in the next) and the ability to pass through GPUs for ML workloads. Docker containers gave us both.

The container manager

The ContainerManager class manages the lifecycle of sandbox containers. Each tenant+session pair gets its own container with its own Docker volume for workspace persistence:

class ContainerManager:
def __init__(self, worker_image, idle_timeout=900, mem_limit="512m",
cpu_quota=100000, enable_gpu=False, max_concurrent_sessions=50):
self.client = docker.from_env()
self.sessions: dict[str, SandboxSession] = {}

When a code execution request arrives, get_or_create either returns an existing running container or creates a new one:

def get_or_create(self, tenant_id, session_id) -> SandboxSession:
key = f"{tenant_id}:{session_id}"
if key in self.sessions:
session = self.sessions[key]
session.container.reload()
if session.container.status == "running":
session.touch()
return session

# Reject if at capacity
if len(self.sessions) >= self.max_concurrent_sessions:
raise RuntimeError("Maximum concurrent sessions reached.")

# Create volume and container...

The capacity limit prevents resource exhaustion at the platform level. When you hit the limit, new sessions are rejected with a clear error rather than silently degrading the host.

Container hardening

Every sandbox container is created with aggressive security constraints:

kwargs = dict(
image=self.worker_image,
command="sleep infinity",
user="1000:1000", # Non-root
mem_limit=self.mem_limit,
memswap_limit=self.mem_limit, # No swap
cpu_period=100000,
cpu_quota=self.cpu_quota,
pids_limit=256, # Prevent fork bombs
network_disabled=True, # No network access
security_opt=["no-new-privileges:true"],
cap_drop=["ALL"], # Drop ALL Linux capabilities
cap_add=["CHOWN", "SETUID", "SETGID", "DAC_OVERRIDE"],
tmpfs={"/tmp": "size=100M,noexec,nosuid,nodev"},
)

Breaking this down:

  • user="1000:1000": The container runs as a non-root user. Even if code exploits a vulnerability in the container runtime, it starts from an unprivileged position.

  • cap_drop=["ALL"]: Every Linux capability is dropped. The container cannot mount filesystems, change network configuration, load kernel modules, or use raw sockets. Only the four capabilities needed for basic file operations are added back.

  • no-new-privileges:true: Prevents privilege escalation via setuid/setgid binaries. Even if a setuid binary exists in the container image, it cannot gain elevated privileges.

  • pids_limit=256: Hard cap on the number of processes. A fork bomb hits this limit and stops rather than consuming all PIDs on the host.

  • network_disabled=True: No network access at all. The sandbox cannot make outbound HTTP requests, connect to databases, or exfiltrate data. All external operations go through the agent runtime's tools (web_search, scrape_url) which are not available inside the sandbox.

  • mem_limit with memswap_limit equal: Memory is capped with no swap. When the limit is hit, the OOM killer terminates the process rather than swapping to disk and slowing down the host.

  • tmpfs with noexec: The /tmp directory is a size-limited tmpfs with no-exec. Code cannot write an executable to tmp and run it.

Path traversal prevention

File read/write operations go through a validation function before touching the container:

@staticmethod
def _validate_path(path: str) -> str:
if "\x00" in path:
raise ValueError("Path contains null bytes")
if path.startswith("/"):
raise ValueError("Absolute paths are not allowed")
if ".." in path.split("/") or ".." in path.split("\\"):
raise ValueError("Path traversal ('..') is not allowed")
normalized = posixpath.normpath(path)
if normalized.startswith("..") or normalized.startswith("/"):
raise ValueError("Path traversal is not allowed after normalization")
return normalized

This prevents agents from reading /etc/passwd or writing to /usr/bin inside the container. All paths are relative to /workspace.

GPU passthrough

For ML workloads (training small models, running inference on data, generating visualizations with GPU acceleration), the container manager supports GPU passthrough:

if self.enable_gpu:
kwargs["device_requests"] = [
docker.types.DeviceRequest(count=-1, capabilities=[["gpu"]])
]
if self.enable_network:
kwargs["network_disabled"] = False
kwargs["cap_add"].append("SYS_PTRACE") # CUDA profiling

GPU passthrough is separate from network access. GPU containers still keep networking disabled unless an operator explicitly enables networked sandbox sessions, and SYS_PTRACE is added only for CUDA profiling support. These containers are created only when an administrator explicitly enables GPU mode for a tenant.

Lifecycle management

Containers are not permanent. A background task runs cleanup_stale() periodically, removing containers idle for longer than the configured timeout (default 15 minutes):

def cleanup_stale(self):
now = time.time()
stale_keys = [
k for k, s in self.sessions.items()
if now - s.last_used > self.idle_timeout
]
for key in stale_keys:
session = self.sessions.pop(key)
session.container.stop(timeout=5)
session.container.remove(force=True)
self.client.volumes.get(session.volume_name).remove(force=True)

Both the container and its volume are destroyed. No state leaks between sessions after cleanup.

Why not E2B?

E2B is a solid hosted sandbox service. We evaluated it and decided against it for two reasons:

  1. Sovereignty. E2B runs containers on their infrastructure. For an air-gapped or on-premise deployment, that is a non-starter. The sandbox system needs to run on the same infrastructure as everything else.

  2. GPU access. E2B does not currently support GPU passthrough. Our users need to run PyTorch and CUDA workloads inside sandboxes.

The tradeoff is that we maintain our own container management code. For a platform that already runs Docker Compose and Kubernetes, this is a reasonable cost. The ContainerManager class is around 250 lines and handles the full lifecycle.