Let's discuss sandbox isolation

2026-02-27T20:15:52.000Z·★ 100·16 min read
There is a lot of energy right now around sandboxing untrusted code. AI agents generating and executing code, multi-tenant platforms run...

There is a lot of energy right now around sandboxing untrusted code. AI agents generating and executing code, multi-tenant platforms running customer scripts, RL training pipelines evaluating model outputs—basically, you have code you did not write, and you need to run it without letting it compromise the host, other tenants, or itself in unexpected ways. The word “isolation” gets used loosely. A Docker container is “isolated.” A microVM is “isolated.” A WebAssembly module is “isolated.” But these are fundamentally different things, with different boundaries, different attack surfaces, and different failure modes. I wanted to write down my learnings on what each layer actually provides, because I think the distinctions matter and allow you to make informed decisions for the problems you are looking to solve. When any code runs on Linux, it interacts with the hardware through the kernel via system calls. The Linux kernel exposes roughly 340 syscalls, and the kernel implementation is tens of millions of lines of C code. Every syscall is an entry point into that codebase. Untrusted Code ─( Syscall )─→ Host Kernel ─( Hardware API )─→ Hardware [ 40M LOC C ] Every isolation technique is answering the same question of how to reduce or eliminate the untrusted code’s access to that massive attack surface. A useful mental model here is shared state versus dedicated state. Because standard containers share the host kernel, they also share its internal data structures like the TCP/IP stack, the Virtual File System caches, and the memory allocators. A vulnerability in parsing a malformed TCP packet in the kernel affects every container on that host. Stronger isolation models push this complex state up into the sandbox, exposing only simple, low-level interfaces to the host, like raw block I/O or a handful of syscalls. The approaches differ in where they draw the boundary. Namespaces use the same kernel but restrict visibility. Seccomp uses the same kernel but restricts the allowed syscall set. Projects like gVisor use a completely separate user-space kernel and make minimal host syscalls. MicroVMs provide a dedicated guest kernel and a hardware-enforced boundary. Finally, WebAssembly provides no kernel access at all, relying instead on explicit capability imports. Each step is a qualitatively different boundary, not just a stronger version of the same thing. Namespaces as visibility walls ------------------------------ Linux namespaces wrap global system resources so that processes appear to have their own isolated instance. There are eight types, and each isolates a specific resource. | Namespace | What it isolates | What the process sees | | --- | --- | --- | | PID | Process IDs | Own process tree, starts at PID 1 | | Mount | Filesystem mount points | Own mount table, can have different root | | Network | Network interfaces, routing | Own interfaces, IP addresses, ports | | User | UID/GID mapping | Can be root inside, nobody outside | | UTS | Hostname | Own hostname | | IPC | SysV IPC, POSIX message queues | Own shared memory, semaphores | | Cgroup | Cgroup root directory | Own cgroup hierarchy | | Time | System clocks (monotonic, boot) | Own system uptime and clock offsets | Namespaces are what Docker containers use. When you run a container, it gets its own PID namespace (cannot see host processes), its own mount namespace (own filesystem view), its own network namespace (own interfaces), and so on. The critical thing to understand is namespaces are visibility walls, not security boundaries. They prevent a process from _seeing_ things outside its namespace. They do not prevent a process from _exploiting the kernel_ that implements the namespace. The process still makes syscalls to the same host kernel. If there is a bug in the kernel’s handling of any syscall, the namespace boundary does not help. In January 2024, CVE-2024-21626 showed that a file descriptor leak in runc (the standard container runtime) allowed containers to access the host filesystem. The container’s mount namespace was intact — the escape happened through a leaked fd that runc failed to close before handing control to the container. In 2025, three more runc CVEs (CVE-2025-31133, CVE-2025-52565, CVE-2025-52881) demonstrated mount race conditions that allowed writing to protected host paths from inside containers. Cgroups: accounting is not security ----------------------------------- Cgroups (control groups) limit and account for resource usage: CPU, memory, disk I/O, number of processes. They prevent a container from consuming all available memory or spinning up thousands of processes. Cgroups are important for stability, but they are not a security boundary. They prevent denial-of-service, not escape. A process constrained by cgroups still makes syscalls to the same kernel with the same attack surface. Seccomp-BPF as a filter ----------------------- Seccomp-BPF lets you attach a Berkeley Packet Filter program that decides which syscalls a process is allowed to make. You can deny dangerous syscalls like process tracing, filesystem manipulation, kernel extension loading, and performance monitoring. Docker applies a default seccomp profile that blocks around 40 to 50 syscalls. This meaningfully reduces the attack surface. But the key limitation is that seccomp is a filter on the same kernel. The syscalls you allow still enter the host kernel’s code paths. If there is a vulnerability in the write implementation, or in the network stack, or in any allowed syscall path, seccomp does not help. Without Seccomp: Untrusted Code ─( ~340 syscalls )─→ Host Kernel With Seccomp: Untrusted Code ─( ~300 syscalls )─→ Host Kernel The attack surface is smaller. The boundary is the same. ### Running a container in privileged mode This is worth calling out because it comes up surprisingly often. Some isolation approaches require Docker’s privileged flag. For example, building a custom sandbox that uses nested PID namespaces inside a container often leads developers to use privileged mode, because mounting a new /proc filesystem for the nested sandbox requires the CAP_SYS_ADMIN capability (unless you also use user namespaces). If you enable --privileged just to get CAP_SYS_ADMIN for nested process isolation, you have added one layer (nested process visibility) while removing several others (seccomp, all capability restrictions, device isolation). The net effect is arguably weaker isolation than a standard unprivileged container. This is a real trade-off that shows up in production. The ideal solutions are either to grant only the specific capability needed instead of all of them, or to use a different isolation approach entirely that does not require host-level privileges. gVisor and user-space kernels ----------------------------- gVisor is where the isolation model changes qualitatively. To understand the difference, it helps to look at the attack surface of a standard container. Standard Container (Docker) ┌───────────────────────┐ │ Untrusted Code │ └──────────┬────────────┘ │ ~340 syscalls [ Seccomp Filter ] │ ~300 allowed syscalls ┌───────────────────────┐ │ Host Kernel (Ring 0) │ ◄── FULL ATTACK SURFACE └───────────────────────┘ The code runs as a standard Linux process. Seccomp acts as a strict allowlist filter, reducing the set of permitted system calls. However, any allowed syscall still executes directly against the shared host kernel. Once a syscall is permitted, the kernel code processing that request is the exact same code used by the host and every other container. The failure mode here is that a vulnerability in an allowed syscall lets the code compromise the host kernel, bypassing the namespace boundaries. Instead of filtering syscalls to the host kernel, gVisor interposes a completely separate kernel implementation called the Sentry between the untrusted code and the host. The Sentry does not access the host filesystem directly; instead, a separate process called the Gofer handles file operations on the Sentry’s behalf, communicating over a restricted protocol. This means even the Sentry’s own file access is mediated. gVisor ┌───────────────────────┐ │ Untrusted Code │ └──────────┬────────────┘ │ ~340 syscalls ┌───────────────────────┐ │ gVisor Sentry (Ring 3)│ ◄── USER-SPACE KERNEL └──────┬────────┬───────┘ │ │ 9P / LISAFS │ ▼ │ ┌───────────┐ │ │ Gofer │ ◄── FILE I/O PROXY │ └─────┬─────┘ │ │ ▼ ▼ ┌───────────────────────┐ │ Host Kernel (Ring 0) │ ◄── REDUCED ATTACK SURFACE └───────────────────────┘ (~70 host syscalls from Sentry) The Sentry intercepts the untrusted code’s syscalls and handles them in user-space. It reimplements around 200 Linux syscalls in Go, which is enough to run most applications. When the Sentry actually needs to interact with the host to read a file, it makes its own highly restricted set of roughly 70 host syscalls. This is not just a smaller filter on the same surface; it is a completely different surface. The failure mode changes significantly. An attacker must first find a bug in gVisor’s Go implementation of a syscall to compromise the Sentry process, and then find a way to escape from the Sentry to the host using only those limited host syscalls. The Sentry intercepts syscalls using one of several mechanisms, such as seccomp traps or KVM, with the default since 2023 being the seccomp-trap approach known as systrap. What this means in practice is that if someone discovers a bug in the Linux kernel’s I/O implementation, containers using Docker are directly exposed. A gVisor sandbox is not, because those syscalls are handled by the Sentry, and the Sentry does not expose them to the host kernel. The trade-off is performance. Every syscall goes through user-space interception, which adds overhead. I/O-heavy workloads feel this the most. For short-lived code execution like scripts and tests, it is usually fine, but for sustained high-throughput I/O, it can matter. Also, by adopting gVisor, you are betting that it’s easier to audit and maintain a smaller footprint of code (the Sentry and its limited host interactions) than to secure the entire massive Linux kernel surface against untrusted execution. That bet is not free of risk, gVisor itself has had security vulnerabilities in the Sentry but the surface area you need to worry about is drastically smaller and written in a memory-safe language. Defense in depth on top of gVisor --------------------------------- gVisor gives you the user-space kernel boundary. What it does not give you automatically is multi-job isolation within a single gVisor sandbox. If you are running multiple untrusted executions inside one runsc container, you still need to layer additional controls. Here is one pattern for doing that: * Per-job PID + mount + IPC namespaces via clone3 — so each execution is isolated from other executions inside the same gVisor sandbox * Seccomp-BPF inside the namespace — blocking syscalls like clone3 (preventing nested namespace escape), io_uring (force fallback to epoll), ptrace, kernel module loading * Privilege drop — run as nobody (UID 65534) with PR_SET_NO_NEW_PRIVS * Ephemeral tmpfs for all writable paths — cleanup is a single umount2 syscall, not a recursive directory walk Read-only root filesystem — the container itself is immutable Capability-based file APIs — use openat2 or similar to confine file writes to the work directory, preventing path traversal via ../../etc/passwd * Network egress control — compute isolation means nothing if the sandbox can freely phone home. Options range from disabling networking entirely, to running an allowlist proxy (like Squid) that blocks DNS resolution inside the sandbox and forces all traffic through a domain-level allowlist, to dropping CAP_NET_RAW so the sandbox cannot bypass DNS with raw sockets. gVisor Container (runsc) └─ Per-job PID + Mount Namespace └─ Seccomp BPF Filter └─ Privilege Drop └─ Network Egress Control └─ Ephemeral tmpfs └─ Capability-confined File Writes Each layer catches different attack classes. A namespace escape inside gVisor reaches the Sentry, not the host kernel. A seccomp bypass hits the Sentry’s syscall implementation, which is itself sandboxed. Privilege escalation is blocked by dropping privileges. Persistent state leakage between jobs is prevented by ephemeral tmpfs with atomic unmount cleanup. ### A note on forking A practical detail that matters is the process that creates child sandboxes must itself be fork-safe. If you are running an async runtime, forking from a multithreaded process is inherently unsafe because child processes inherit locked mutexes and can corrupt state. The solution is a fork server pattern where you fork a single-threaded launcher process before starting the async runtime, then have the async runtime communicate with the launcher over a Unix socket. The launcher creates children, entirely avoiding the multithreaded fork problem. Startup fork() → Launcher (Single-threaded, Poll Loop) ├─ clone3(NEWPID | NEWNS | NEWIPC) └─ Child (Mount, Privdrop, Seccomp, Execve) Main Server (Async Runtime) └─ AF_UNIX SEQPACKET ─→ Launcher MicroVMs for hardware boundaries -------------------------------- MicroVMs use hardware virtualization backed by the CPU’s extensions to run each workload in its own virtual machine with its own kernel. MicroVM Architecture ┌───────────────────────┐ │ Untrusted Code │ └──────────┬────────────┘ │ Syscalls ┌───────────────────────┐ │ Guest Kernel (Ring 0) │ ◄── DEDICATED KERNEL └──────────┬────────────┘ │ VirtIO / MMIO ┌───────────────────────┐ │ KVM Hypervisor (Host) │ ◄── HARDWARE BOUNDARY └──────────┬────────────┘ │ Secure API ┌───────────────────────┐ │ VMM (User-Space) │ ◄── DEVICE EMULATION └───────────────────────┘ Code runs in a completely separate, hardware-backed environment with its own guest kernel. It is important to separate the concepts here. The hypervisor is the capability built into the Linux kernel that manages the CPU’s hardware virtualization extensions. The Virtual Machine Monitor is a user-space process that configures the VM, allocates memory, and emulates minimal hardware devices. The microVM itself is a VM that has been stripped of legacy PC cruft so it boots in milliseconds and uses minimal memory. Escaping the guest kernel requires finding a vulnerability in the Virtual Machine Monitor’s device emulation or the CPU’s virtualization features, which are rare and highly prized. The guest runs in a separate virtual address space enforced by the CPU hardware. A bug in the guest kernel cannot access host memory because the hardware prevents it. The host kernel only sees the user-space process. The attack surface is the hypervisor and the Virtual Machine Monitor, both of which are orders of magnitude smaller than the full kernel surface that containers share. You generally see two different approaches to Virtual Machine Monitor design depending on the workload. The first is strict minimalism, seen in projects like Firecracker. Built specifically for running thousands of tiny, short-lived functions on a single server, it intentionally leaves out complex features like hot-plugging CPUs or passing through physical GPUs. The goal is simply the smallest possible attack surface and memory footprint. The second approach offers broader feature support, seen in projects like Cloud Hypervisor or QEMU microvm. Built for heavier and more dynamic workloads, it supports hot-plugging memory and CPUs, which is useful for dynamic build runners that need to scale up during compilation. It also supports GPU passthrough, which is essential for AI workloads, while still maintaining the fast boot times of a microVM. ### Trade-off The trade-off versus gVisor is that microVMs have higher per-instance overhead but stronger, hardware-enforced isolation. For CI systems and sandbox platforms where you create thousands of short-lived environments, the boot time and memory overhead add up. For long-lived, high-security workloads, the hardware boundary is worth it. Snapshotting is a feature worth noting. You can capture a running VM’s state including CPU registers, memory, and devices, and restore it later. This enables warm pools where you boot a VM once, install dependencies, snapshot it, and restore clones in milliseconds instead of booting fresh each time. This is how some platforms achieve incredibly fast cold starts even with full VM isolation. WebAssembly with no kernel at all --------------------------------- WebAssembly takes a fundamentally different approach. Instead of running native code and filtering its kernel access, WASM runs code in a memory-safe virtual machine that has no syscall interface at all. All interaction with the host happens through explicitly imported host functions. WebAssembly (WASM) ┌───────────────────────┐ │ Untrusted Code │ └──────────┬────────────┘ │ Function Calls ┌───────────────────────┐ │ WASM Runtime (Host) │ ◄── MEMORY-SAFE VM └──────────┬────────────┘ │ Imported Host Functions ┌───────────────────────┐ │ Allowed Host APIs │ ◄── EXPLICIT CAPABILITIES └───────────────────────┘ Code runs in a strict sandbox where the only allowed operations are calling functions provided by the host. If the host doesn’t provide a file reading function, the WASM module simply cannot read files. The failure mode here requires a vulnerability in the WASM runtime itself, like an out-of-bounds memory read that bypasses the linear memory checks. There is no syscall surface to attack because the code never makes syscalls. Memory safety is enforced by the runtime. The linear memory is bounds-checked, the call stack is inaccessible, and control flow is type-checked. Modern runtimes add guard pages and memory zeroing between instances. The performance characteristics are attractive with incredibly fast cold starts and minimal memory overhead. But the practical limitation is language support. You cannot run arbitrary Python scripts in WASM today without compiling the Python interpreter itself to WASM along with all its C extensions. For sandboxing arbitrary code in arbitrary languages, WASM is not yet viable. For sandboxing code you control the toolchain for, it is excellent. I am, however, quite curious if there is a future for WASM in general-purpose sandboxing. Browsers have spent decades solving a similar problem of executing untrusted code safely, and porting those architectural learnings to backend infrastructure feels like a natural evolution. The spectrum ------------ Putting it all together, the landscape spans from fast and weak isolation to slower and highly secure isolation. Isolation strength → Attack surface ↓ Namespaces Seccomp gVisor MicroVM WASM │ │ │ │ │ │ visibility │ syscall │ separate │ hardware │ no kernel │ walls only │ filter on │ kernel in │ boundary │ access at │ │ same │ user-space │ via KVM │ all │ │ kernel │ │ │ ▼ ▼ ▼ ▼ ▼ Fast Fast Moderate Slower Fastest Weakest Weak Strong Strongest Strong (limited scope) For running trusted code that you wrote and reviewed, Docker with a seccomp profile is probably fine. The isolation is against accidental interference, not adversarial escape. For running untrusted code in a multi-tenant environment, like short-lived scripts, AI-generated code, or customer-provided functions, you need a real boundary. gVisor gives you a user-space kernel boundary with good compatibility, while a microVM gives you a hardware boundary with the strongest guarantees. Either is defensible depending on your threat model

↗ Original source
← Previous: Writing a Guide to SDF FontsNext: Free Claude Max for (large project) open source maintainers →
Comments0