Threat model¶

Versioned with the code. Linked from the README and SECURITY.md. The scope in §8 ("what counts as a vulnerability") is the reference the disclosure policy points back to.

IronClaw assumes the agent inside the sandbox is potentially compromised — by prompt injection, a poisoned tool result, or a hostile model output — and designs the trust boundary so that a compromised agent cannot escalate. The host is trusted; the agent is not. Everything below follows from that single asymmetry.

1. Scope and core assumption¶

Trusted: the host OS and root, the control-plane binary, the maintainer, the gVisor runtime, and the integrity of the build/release pipeline.
Untrusted: the sandboxed agent and everything it emits (tool calls, outbound messages, capability-change requests), every inbound chat message, and every byte returned by an approved external API.
Goal: a fully-compromised agent still cannot read another session's data, change its own configuration, reach the host filesystem or arbitrary network, obtain a host-held secret, or bypass the human-approval gateway.

What is explicitly not in scope (a compromised host, a malicious maintainer, physical attacks, model "alignment") is enumerated in §8–§9 so that disclosure stays sane.

2. Assets¶

Asset	Why it matters
User conversation data	At rest in the per-session queues; the primary confidentiality target.
Cross-session isolation	One session (or agent group) must never read another's queue or workspace.
Host secrets	The Anthropic API key, the keystore master key, per-session 256-bit queue keys, the API bearer token — none may reach the sandbox.
Gateway integrity	The guarantee that no persona/tools/packages/wiring/permissions/mounts/create_agent change applies without a human decision.
The host machine	Filesystem, processes, and network reachability of the box the control-plane runs on.
Model & egress budget	Cost and rate of model calls and approved-egress calls.

3. Trust boundaries and data flow¶

flowchart TB U["User / external sender"] U -- "Tailscale mesh only (B5)" --> API subgraph host["Trusted host"] subgraph cp["Control-plane process"] API["API + bearer auth"] GW["Gateway: deterministic verifier chain + human approval"] RT["Router / delivery / sweep"] MP["Model proxy (holds Anthropic key)"] EB["Egress broker (opt-in)"] end INQ["inbound.db — encrypted, RO to sandbox"] OUTQ["outbound.db — encrypted, RW"] end API --> GW --> RT RT --> INQ OUTQ --> RT subgraph box["Sandbox — gVisor, network=none (B1)"] LOOP["Agent loop + tools"] end INQ -- "RO bind mount (B2)" --> LOOP LOOP -- "append (B2)" --> OUTQ LOOP -- "unix socket" --> MP -- "HTTPS, key injected host-side" --> ANTH["Anthropic API"] LOOP -- "unix socket (opt-in)" --> EB -- "HTTPS, deny-by-default allowlist (B4)" --> EXT["Approved external APIs"] LOOP -. "a2a via queues, gateway-gated (B3)" .-> RT

The boundaries this document analyzes with STRIDE:

B1 — host ↔ sandbox: the gVisor wall. The only crossings are two host unix sockets (model proxy, and the opt-in egress broker) and the bind-mounted queue files. No NIC exists inside the sandbox.
B2 — control-plane ↔ agent: the per-session encrypted queue pair. Inbound is read-only to the agent; outbound is append-only.
B3 — agent ↔ agent (a2a): one agent group messaging another, and the approval-gated create_agent change (RFC-0004).
B4 — egress broker ↔ network: the bounded relaxation of network=none for operator-approved external APIs (see §7).
B5 — admin ↔ control-plane (supporting): the mesh-only API plus bearer token.

4. Adversaries¶

#	Adversary	Capability assumed
A1	Compromised sandbox agent (primary)	Full control of the agent loop: arbitrary tool calls, outbound messages, and change requests.
A2	Malicious external sender	Can send any inbound chat content (the prompt-injection vector that produces A1).
A3	Tailnet-adjacent attacker	Has a foothold on the private mesh and can reach the API port.
A4	Offline / disk attacker	Can read the state directory and queue files at rest.
A5	Malicious approved-egress host	A host the operator added to the egress allowlist that then behaves adversarially.

A compromised host or root (A0) is not modeled — it is the trust anchor (§1).

5. STRIDE by boundary¶

Each table is a STRIDE pass (Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege) for one boundary, with the control that addresses it.

B1 — host ↔ sandbox (the gVisor wall)¶

STRIDE	Threat	Mitigation
S	Agent impersonates the host/model to itself	Host services are reached only over host-owned unix sockets bound into the sandbox; the agent cannot forge the other end.
T	Agent edits its own runtime/source to change behavior	Compiled Go binary, no interpreter, read-only rootfs; there is no source in the box to rewrite.
R	Agent denies an action it took	Every change request and model/egress call is audited host-side, outside the sandbox's reach.
I	Agent reads a host secret (Anthropic key, master key)	Secrets stay host-side; the model proxy injects the API key into the outbound call — the sandbox never sees it. Session keys arrive via tmpfs at launch, never an env var or the image.
D	Agent exhausts host resources	gVisor resource limits; host respawn crash-loop backoff; sweep reclaims stale sandboxes.
E	Sandbox escape to the host	gVisor (runsc): dropped capabilities, `no_new_privs`, non-root user namespace, read-only rootfs.

B2 — control-plane ↔ agent (the encrypted queues)¶

STRIDE	Threat	Mitigation
S	Agent writes a message attributed to another sender/session	The host owns routing and stamps identities; the agent only appends to its own outbound queue.
T	Agent writes the inbound queue it should only read	Three independent layers: Go interface segregation (no write method), `PRAGMA query_only`, and a read-only OS bind mount.
R	Disputed message history	Append-only queues with monotonic sequence; host-side audit.
I	Cross-session / at-rest queue reads (A4)	Per-session encrypted SQLite (raw 256-bit key, pinned cipher params); no plaintext on disk; wrong key fails closed.
D	Agent floods the outbound queue	Sweep + delivery backpressure; per-session isolation bounds blast radius to one session.
E	Agent uses the queue to drive a privileged host action	Queues carry data, not commands; privileged needs must take the gateway path (B3 / change requests), which the queue cannot shortcut.

B3 — agent ↔ agent (a2a + create_agent)¶

STRIDE	Threat	Mitigation
S	Agent spoofs another agent group's identity	Group identities are host-assigned in the registry; a2a routing is mediated by the control-plane, not peer-to-peer.
T	Agent tampers with another group's config via a message	a2a is data only; config changes for any group still require a gateway-approved ChangeRequest.
R	Untraceable agent-to-agent traffic	a2a flows through host routing/delivery and is audited like any other message.
I	Agent reads another group's data by messaging it	Delivery respects destination allowlists and access checks; no shared queue is exposed.
D	Agent fans out messages to amplify load	Routing/delivery is host-controlled; sweep and per-session limits bound it.
E	Agent provisions a new, over-privileged agent (`create_agent`)	`create_agent` is privileged and always routed through the gateway's mandatory human-approval floor — a new agent is a new trust principal and is never auto-approved (RFC-0004).

B4 — egress broker ↔ network¶

STRIDE	Threat	Mitigation
S	Agent reaches an unapproved host by spoofing a target	Forwarding is matched against an explicit host allowlist; an unapproved host returns 403.
T	Agent rewrites the egress destination/allowlist	The sandbox cannot mutate the allowlist; additions/removals clear the gateway and are logged.
R	Deniable exfiltration	Every request — allowed or denied — emits an audit record (host, path, method, status, byte counts, duration).
I	Exfiltration to an attacker-controlled endpoint	Deny-by-default allowlist; empty allowlist forwards nothing; HTTPS only. Residual risk to approved hosts is accepted and bounded (see §7).
D	Agent saturates an approved host / the link	Per-request audit makes abuse visible; rate caps are tracked future work (§9).
E	Broker is used to launder access to a host-held credential	The broker injects no host secrets and forwards only the request's own headers — it cannot grant access the sandbox does not already hold. Not a credential vault (an explicit non-goal).

B5 — admin ↔ control-plane (supporting)¶

Spoofing/EoP from the network: the API binds only to the Tailscale interface (no public port), and an optional bearer token (constant-time compared) means a foothold on the tailnet alone (A3) is not enough to drive the gateway. The /healthz and /readyz probes are the only unauthenticated endpoints.

6. Privilege matrix¶

What each principal can do. gateway = only via a human-approved ChangeRequest.

Capability	Sandbox agent	Control-plane	Admin (human)	External sender	Approved egress host
Read its own session queue	✓	✓	—	—	—
Read another session's queue	✗	✓	—	✗	✗
Write the outbound queue	append-only	✓	—	✗	✗
Write the inbound queue	✗	✓	—	via routing	✗
Change config (persona/tools/wiring/…)	request only → `gateway`	apply after approval	approve/reject	✗	✗
Provision a new agent (`create_agent`)	request only → `gateway`	apply after approval	approve/reject	✗	✗
Reach the model	via host proxy	✓ (holds key)	—	—	—
Reach an external network host	only approved hosts, via broker	✓	configures allowlist	✗	n/a
Reach the host filesystem	✗ (RO rootfs + bind mounts)	✓	✓	✗	✗
Hold a host secret	✗	✓	✓	✗	✗

7. Egress broker¶

The egress broker (internal/host/egress) lets an agent reach operator-approved external APIs beyond the model host. It is a deliberate, bounded relaxation of the original "model host only" posture, designed to keep the trust boundary intact:

The sandbox stays network=none. It never gets a NIC. The broker is a second host unix socket bound into the sandbox (alongside the model-proxy socket); every byte still crosses a host choke point. The egress socket is bound only when SandboxSpec.EgressSocket is set, so the default (HardenedSpec) remains sealed to the model proxy alone.
Deny by default. The broker forwards only to hosts on an explicit allowlist. An empty allowlist forwards nothing; an unapproved host returns 403. In production the allowlist is mutated only after the change clears the gateway's human approval (wired by the daemon).
Audited. Every request — allowed or denied — emits an audit record (host, path, method, status, byte counts, duration). Allowlist additions/removals are logged.
HTTPS only, and not a credential vault (an explicit IronClaw non-goal): the broker injects no host-held secrets and forwards the request's own headers, so it cannot launder access to a credential the sandbox does not already hold.
Agent surface: the opt-in http_fetch sandbox tool, registered only when an egress socket is configured.

Threat-model review sign-off: reviewed and approved by the maintainer (sole CODEOWNER, @omerzamir) on 2026-06-16, via the decision to un-gate the egress broker and wire egress live. Residual risk accepted: a compromised agent can exfiltrate to / pull from any approved host; this is mitigated by deny-by-default approval and full per-request audit, and bounded by keeping the sandbox itself network-less.

8. What counts as a vulnerability¶

For disclosure (see SECURITY.md), a finding is a vulnerability if it lets the untrusted agent or an external sender cross a boundary above — concretely, if it allows any of:

escaping the sandbox to the host (B1);
reading another session's queue/workspace, or any at-rest plaintext (B2);
applying a configuration or create_agent change without human approval, or otherwise bypassing the gateway (B2/B3);
reaching the host filesystem, or a network host that is not on the approved egress allowlist (B1/B4);
causing a host-held secret (Anthropic key, master/session keys, API token) to become readable inside the sandbox (B1);
tampering with or forging audit/queue records (R across boundaries).

A finding is not a vulnerability (so please don't file it as one) if it is:

the model saying something wrong, biased, or "jailbroken" — that is agent behavior, not a boundary breach (the threat model already assumes a hostile agent);
data sent to a host the operator explicitly approved on the egress allowlist — that is the feature working as designed (§7);
anything that requires a compromised host, root, or the maintainer (A0), or physical access — outside the trust model (§1);
a denial-of-service that requires host privileges to trigger;
the agent failing to complete a task.

9. Non-goals and documented future work¶

Intentional non-goals of the sealed / network=none design (do not file these as gaps): in-sandbox web/browser access, package installation / self-modification, and a credential vault built into the egress broker / trusted core (the broker is never a secret sink — B4-E). Request-time credential injection via a separate host-side principal behind the broker — the vault-behind-broker integration — is now supported under host governance (see §11); building injection into the broker stays the non-goal. (Multiple model-provider backends were previously listed here too; they are now supported under host governance — see §10.)

Documented future hardening: per-host egress rate caps; a Kata isolation backend behind the same Isolator interface; automated (non-human) gateway approval for low-risk change kinds, with the mandatory-human floor staying the default. (Response-secret redaction for the egress broker — the model-proxy redaction pattern applied to egress — is now implemented for the vault path; see §11.)

10. Multi-provider model egress¶

The model proxy (internal/host/modelproxy) supports per-agent-group model provider selection beyond the original Anthropic-only posture: a group may run on Anthropic (the default), OpenAI, or OpenRouter. This was a §9 non-goal; it is now supported as a deliberate, host-governed relaxation that keeps the trust boundary intact — it reuses the exact same choke point and controls as the single-provider proxy:

The sandbox stays network=none. Every provider is reached through the same host model-proxy unix socket; no new socket and no NIC are added. The sandbox-side provider abstraction (internal/sandbox/provider) only ever dials that socket, addressing the real upstream host so the proxy's allowlist matches and routes it.
The host is still the sole authenticator. Each provider's credential lives only on the host (ANTHROPIC_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY); the proxy strips any sandbox-supplied auth and injects the credential matching the upstream host (MultiInjector). The sandbox holds no provider key — the §1 "host secret" guarantee is unchanged for every provider.
Deny by default, and only what the operator enabled. A provider is reachable only when its credential is present in the control-plane environment: the proxy allowlists exactly the enabled providers' hosts, so an un-configured provider returns 403 like any other unapproved host. Provider selection for a group is registry config (AgentGroup.Provider/Model), consumed at sandbox launch; a change to it is a gateway-gated configuration change like any other.
Audited uniformly. The per-request audit record (host, path, method, status, byte counts, duration) and the rate cap apply to every provider, not just Anthropic — multi-provider egress is as observable as the original path. Each provider's key is also registered for response-secret redaction.

Threat-model review sign-off: reviewed and approved by the maintainer (sole CODEOWNER) on 2026-06-17, via the decision to support per-group providers. Residual risk accepted: enabling a provider adds one egress destination and one host-held credential. This is bounded exactly as the egress broker's risk (§7) is — deny-by-default enablement, full per-request audit, host-only credentials, and a sandbox that remains network-less — so it introduces no boundary crossing that the single-provider proxy did not already define.

11. Credential vault behind the broker¶

Long-running agents that call real third-party APIs need credentials for them. IronClaw supports request-time credential injection — an agent references a credential by LOGICAL NAME (vault://<cred>/<path>) and never holds a key — without weakening either of its two most sensitive components. The design is integrate, not build-into-the-broker: the secret-holding injector is a separate host-side principal the broker forwards TO, never the broker itself.

The secret never enters the sandbox. The agent addresses a credential by name and receives only the upstream response; no plaintext credential is ever in the sandbox's address space — the §1 "host secret" guarantee holds for vaulted credentials exactly as for the model key.
The secret never enters the egress broker — B4-E is unchanged. The broker forwards a vault:// request's own bytes, by name, to the configured host-local injector endpoint, injecting nothing (internal/host/egress/vault.go strips any client Authorization and adds only the logical credential name). The injector — a distinct OS principal — is the sole holder of the credential and the only component that attaches it, host-side. The broker's specification and blast radius (§7) are untouched: it is still "not a credential vault".
Deny by default, gateway-gated policy. "Which agent group may use which credential against which host" is host-side config (internal/host/registry VaultPolicyStore), read-only to the sandbox and mutated only through the gateway's human-approval path — a policy change is a capability change like any other. An unlisted group/credential/host is refused, and the injector endpoint is itself deny-by-default on the broker allowlist.
Audited end-to-end. A host-generated correlation id (internal/host/egress/correlate.go) joins the broker's per-request audit to the injector's injection/policy audit, so a single credential use is traceable across both principals (the §5 Repudiation control, extended). The id is host-authored: a sandbox-supplied value is overwritten, so audit correlation cannot be forged.
Redaction backstop. Even though the broker holds no credential, configured secrets are scrubbed from responses on the broker→sandbox hop (internal/host/egress/redact.go, the model-proxy redaction pattern) so an injected credential can never echo back if an upstream reflects it.

Building a credential vault into the egress broker or the trusted core — making the broker a secret sink — remains a non-goal (§9). The value here comes precisely from keeping injection in a separate, swappable principal: IronClaw's minimal in-tree injector by default, or an operator-vetted external vault (e.g. OneCLI) behind the same broker→injector contract.

Threat-model review sign-off: reviewed and approved by the maintainer (sole CODEOWNER) on 2026-06-17, via the needs-human decision to BUILD the vault-behind- broker integration (credential-vault spike, approved 2026-06-17) and the spike-2 resolution closing its open questions. Residual risk accepted: a vaulted call adds one host-local destination (the injector) and concentrates credentials in that separate principal; this is bounded by deny-by-default per-group policy, end-to-end audit correlation, the response-redaction backstop, and a broker and sandbox that are otherwise exactly as specified — so it introduces no boundary crossing beyond the injector principal the design intentionally adds.

12. Skills / extension system¶

A skill is a host-side, gateway-gated capability bundle — it declares the persona text, the already-compiled tools, the egress hosts, and the read-only assets an agent group should be granted. It is data, not code: a skill never ships a script, interpreter, or post-install hook (the sealed-runtime pillar of §1). This is the deliberate answer to the peers' extensibility (openclaw's SKILL.md + ClawHub auto-install, nanoclaw's branch-copy) without reintroducing either half of the open + auto-install vector those ecosystems were exploited through.

The skills boundary¶

Install is a gateway ChangeRequest, never a sandbox action. Two triggers, one floor. The operator CLI (ironctl skill add) triggers it out of session; an agent can propose it from chat via request_capability_change (kind skill_install, RFC-0006) — the agent NAMES a skill, never authors its content. Either way the host fetches + signature-verifies + validates the named manifest, then synthesizes one ChangePermissions ChangeRequest bundling the declared grants. The change rides the existing verifier chain and the AlwaysRequireHuman floor exactly like any other capability change — never auto-approved. The in-session proposal is fail-closed: a skill that is unknown, unsigned, out-of-policy, or proposed when skills are disabled is refused host-side and never reaches the gateway. The agent may ask, only a human may grant (the create_agent/RFC-0004 posture, B3).
Apply touches config only. On approval the change updates registry / egress allowlist / mount allowlist — it never writes the read-only rootfs or adds an executable. The install payload is structurally incapable of carrying a command, script, or rootfs path (enforced + tested).
Assets are read-only data. A skill's bundled files mount at /skills/<name> with nosuid,nodev,noexec; they are read, never executed.

Trust model for third-party skills¶

Third-party skill content is untrusted by default — the same posture as an inbound chat message or an egress response (§1):

Curated host source, not an open marketplace. Skills resolve only from a host-configured source (a pinned ref / operator-controlled registry), never an agent-supplied URL. open + auto-install is the vector; IronClaw keeps neither half.
Signature verification before display. A skill whose signature does not verify against the configured trust root is refused at fetch time — it never reaches the approval step (minisign/ed25519, fail-closed).
Manifest validation fails closed. Tools must be a subset of the compiled sandbox registry; egress entries must be bare hostnames (no wildcards); assets must be relative in-bundle paths. Any violation rejects the manifest before a ChangeRequest exists.
Every grant is explicit and human-approved. Because a skill cannot run code, its only damage surface is the capabilities it requests — and each one is named in the manifest and shown to the approver in the change diff. A trojaned skill that quietly asks for egress: evil.example.com is visible and rejected, not discovered post-breach.

Residual risk¶

The worst a hostile third-party skill can do is request privileges — which a human sees in the diff and denies — and it can never execute code or self-install. An approved-but-malicious skill is bounded by exactly the runtime controls a hand-configured agent already has: network=none, read-only rootfs, dropped caps, broker-mediated + audited egress, and read-only assets. A skill therefore adds no new boundary crossing beyond the (already-modeled) capability grants in §6/§7; it only makes granting them a reviewable, signed, bundled operation.

Threat-model review sign-off: reviewed and approved by the maintainer (sole CODEOWNER, @omerzamir) on 2026-06-17, via the decision approving the host-side, gateway-gated skills system and explicit maintainer authorization to record this sign-off. Residual risk accepted: a skill can request capability grants, all of which are signed, validated, shown in the change diff, and held for human approval; an approved skill runs under the unchanged sandbox controls (§1, §7) and introduces no new boundary crossing — it cannot execute code or self-install.

13. MCP servers¶

MCP (Model Context Protocol) servers extend an agent with externally-served tools. The reference design IronClaw hardens treated MCP as a blind approval surface — "approve this server" pulled in whatever tools and reach it chose. IronClaw keeps MCP entirely host-side and gateway-gated, so it adds tool reach without adding a boundary the sandbox can cross. Design + how-to: mcp.md; the one frozen-contract value is RFC-0005 (ChangeMCPAccess) in contract.md.

The MCP boundary (a new trust edge: broker ↔ MCP server)¶

The sandbox never speaks MCP. It holds no MCP client, no network, and no credentials. Its only MCP endpoint is a per-session unix socket served by the host broker (a plain GET /tools / POST /call shim) — the same network=none posture as the model-proxy and egress sockets (B1). The per-session socket is the trusted session identity, so a sandbox cannot spoof another session's surface with a header.
Access is a gateway ChangeRequest, never a sandbox action. Granting an agent a server + a named tool subset is ChangeMCPAccess. A deterministic MCPServerVerifier rejects a grant for an unconfigured server; the change then rides the AlwaysRequireHuman floor like any other (B5/§6). The human approves a named server AND named tools — never "whatever the server exposes". A sandbox can at most emit such a request via request_capability_change (the create_agent/skills posture, B3).
Deny-by-default at every call. The broker resolves the session's currently- approved grant on each tools/list and tools/call and refuses any tool the grant does not name and any tool the server does not declare. A revoked grant stops working immediately. Denials are audited, not silent.
Audited like egress. Every list/call emits a record (session, server, tool, status, bytes, duration) to the structured logs — never arguments or credential values.

Trust model for the MCP server itself (untrusted by default)¶

A third-party MCP server is untrusted code, the same posture as an inbound message or an egress response (§1):

Local (stdio) servers are isolated. By default a local server runs in a hardened container (network=none, read-only rootfs, non-root, all caps dropped, no-new-privileges, cgroup caps; optionally gVisor via --mcp-runtime runsc) — never a bare host process. --mcp-isolation none (dev only) opts out and the daemon warns.
Remote servers are TLS-only. A remote endpoint must be https (plain http is allowed only for a loopback host, for local testing), so the host↔server hop is encrypted.
Credentials stay host-side. Server env/headers use ${ENV} references the broker expands at connect time; the catalog stores no raw secret, the API masks values, and nothing crosses to the sandbox. Local-server secrets are forwarded to the container by name (-e KEY), never in the argv (no ps leak). The broker is not a credential vault for the agent — it injects the server's own configured auth toward the server, never launders a host secret back to the sandbox (the B4-E posture).
Operator-configured catalog, not an agent-supplied URL. A sandbox can only name a server an operator already configured; it can never point the broker at an arbitrary endpoint.

Residual risk¶

The worst a hostile MCP server can do is misbehave within the tools an agent was explicitly, human-approvedly granted, under the broker's deny-by-default gating and audit, and — for a local server — inside a network=none hardened container. It cannot reach the host network from inside that container, cannot obtain a host credential it was not configured with, and cannot widen its own tool surface (that needs another human-approved grant). MCP therefore adds a new outbound trust edge (host → server) that is isolated, TLS'd, and audited, but it introduces no new inbound boundary crossing into the host or the sandbox beyond the capability grants already modeled in §6.