Skip to content

Security and isolation

Most AI-agent tools ask you to trust the agent. IronClaw does not. It assumes the agent inside the sandbox is already compromised, running attacker code as its own user, and is built so that even then it cannot escape the box, read another session, reach the network, or change its own permissions without a human.

This page is the proof of that claim in one place: the architecture that draws the line, the attacks we run against our own sandbox to test the line, and the measured cost of holding it.

How the box is drawn

The host is the trust root. The agent is not. Everything the agent can touch lives on the left of the wall below. The only ways across are a handful of host-owned choke points, and every change the agent asks for is held for a human.

IronClaw isolation architecture: an untrusted gVisor sandbox on the left, separated by the B1 gVisor wall from the trusted host on the right. The only crossings are host-owned unix sockets and per-session encrypted queues. A legend maps each of six red-team escape attempts to the control that contains it.
The trust boundaries from the threat model (B1 to B5), rendered as one static picture. Open the image for a full-size view. Every component name matches the codebase.

The wall itself is gVisor (runsc), a user-space kernel that gives the agent a full Linux syscall surface while the real host kernel stays behind a seccomp-bounded, capability-dropped boundary. Sandboxes run with network=none, a read-only rootfs, no_new_privs, and a non-root user namespace. The only crossings are two host-owned unix sockets (the model proxy that holds your provider key, and the opt-in egress broker that is deny-by-default) plus the per-session encrypted queues (inbound.db bound read-only, outbound.db append-only).

What a compromised agent still cannot do

We do not just assert the wall holds. We attack it. The red-team escape harness stands up the offline demo control-plane, engages a real per-session sandbox, and then runs each attack below from inside that sandbox, as the sandbox's own user, exactly the privilege a jailbroken agent would have. It asserts each one is contained.

# Attack from inside the sandbox What contains it Boundary
1 Network egress: enumerate NICs, resolve DNS network=none, so only loopback exists and a packet has nowhere to go B1
2 Host escape via the Docker Engine socket The Engine socket is never bound into the sandbox B1
3 Sibling container breakout No docker client and no socket, so the daemon is unreachable B1
4 Read arbitrary host filesystem paths Host root is outside the sandbox mount namespace (read-only rootfs) B1
5 Self-modification: enable a new tool The request is held at the gateway's mandatory human-approval floor B3
6 Cross-session key theft (master and sibling keys) Per-session binds only; the master key and sealed store are never mounted B1, B2

Every row is contained on the production gVisor posture. The self-modification case (row 5) is the one worth dwelling on: an agent can ask to enable a new tool, but it can never apply the change. The request lands on the encrypted outbound queue, the host turns it into a change request, and the gateway holds it in GET /v1/changes/pending until a human approves it. There is no bypass path.

The same harness runs as a continuous CI gate that also carries a negative control: it deliberately weakens the sandbox and asserts the harness catches the regression, so the gate can never go quietly blind.

Honest scope: the laptop demo relaxes two things

The zero-credential demo runs the runc fallback for laptop friendliness, which shares the host kernel and (until the per-session bind fix, IRO-259, lands) binds the whole state directory. The harness prints those as tracked gaps rather than pretending they are closed. The network, Docker-socket, sibling-breakout, and gateway boundaries hold identically on both paths. See the harness README for the full demo-versus-production accounting.

What the wall costs

Isolation is not free, but the cost is bounded and predictable rather than a tax on every operation. Numbers below are the profile you should expect; reproduce them with the harness in Performance and footprint.

Dimension Overhead versus a runc baseline
CPU-bound reasoning Near-native, within a few percent. Work that stays in userspace barely touches the wall.
Memory per sandbox A fixed additive cost, on the order of tens of MiB of RSS beyond the workload.
Sandbox start A one-time additive cost of roughly a couple hundred milliseconds, per launch, not per request.
Syscall or I/O heavy bursts The largest gap, often in the range of roughly 1.5x to 2.5x, because every syscall is mediated. This is the isolation you are buying.
Network throughput Not applicable. Sandboxes run network=none with no NIC, so gVisor's weakest dimension is removed by design.

The takeaway: agent reasoning runs near-native, per-sandbox memory is a roughly fixed cost so host capacity scales linearly with agent count, and the one real overhead sits on syscall-heavy bursts, which is exactly the mediation that makes the box a box.

Verify it yourself

  • Run the red-team harness

    One command, no credentials. It attacks a live sandbox from the inside and prints a PASS or FAIL table.

  • Read the threat model

    The full boundary-by-boundary STRIDE analysis behind the diagram, and what counts as a vulnerability.

  • Reproduce the benchmarks

    The measured overhead, the methodology, and how to run it on your own host.

  • Verify a release

    Every build is checksummed, keyless-signed with cosign, and carries build-provenance attestations.

For the invariants that hold across all of this, start at the Security and trust overview.