Sandbox backends¶
Bernstein isolates every spawned agent in a sandbox so multiple agents running against the same repository cannot stomp on each other's files, processes, or secrets. Historically the only sandbox type was a local git worktree. As of oai-002 the choice of sandbox is pluggable — agents can run inside worktrees, Docker containers, E2B microVMs, Modal sandboxes, or any backend a plugin author registers.
This document covers:
- The
SandboxBackend/SandboxSessionprotocol and theWorkspaceManifest/SandboxCapabilityvalue objects - The four first-party backends (
worktree,docker,e2b,modal) - The
bernstein.sandbox_backendsentry-point group for third-party backends - The phased rollout plan (phase 1 lands the protocol and backends; phase 2 — tracked as
oai-002b— refactors the spawner to route adapter exec throughSandboxSession)
Protocol shape¶
The protocol lives in src/bernstein/core/sandbox/:
from bernstein.core.sandbox import (
SandboxBackend,
SandboxSession,
SandboxCapability,
WorkspaceManifest,
GitRepoEntry,
FileEntry,
ExecResult,
get_backend,
list_backends,
register_backend,
)
SandboxBackend¶
A runtime_checkable Protocol. Every backend exposes:
name: str— canonical identifier referenced fromplan.yaml.capabilities: frozenset[SandboxCapability]— feature flags.async def create(manifest, options=None) -> SandboxSession— provision a fresh sandbox.async def resume(snapshot_id) -> SandboxSession— restore a snapshot; raisesNotImplementedErrorif the backend does not declareSandboxCapability.SNAPSHOT.async def destroy(session) -> None— tear down a session.
SandboxSession¶
An ABC with six abstract methods:
read(path) -> byteswrite(path, data, *, mode=0o644) -> Noneexec(cmd, *, cwd=None, env=None, timeout=None, stdin=None) -> ExecResultls(path) -> list[str]snapshot() -> str(SNAPSHOT-capable backends only)shutdown() -> None(idempotent)
ExecResult is a frozen dataclass with exit_code, stdout, stderr, and duration_seconds.
SandboxCapability¶
An StrEnum with six values: FILE_RW, EXEC, NETWORK, GPU, SNAPSHOT, PERSISTENT_VOLUMES. Every backend advertises the set it supports; schedulers reject manifests requiring capabilities the selected backend does not expose.
WorkspaceManifest¶
Immutable value object passed to SandboxBackend.create:
@dataclass(frozen=True)
class WorkspaceManifest:
root: str = "/workspace"
repo: GitRepoEntry | None = None
files: tuple[FileEntry, ...] = ()
env: Mapping[str, str] = field(default_factory=dict)
timeout_seconds: int = 1800
GitRepoEntry and FileEntry are companion frozen dataclasses. Cloud-specific mount entries (S3, persistent volumes, secrets manager bindings) are intentionally deferred to oai-003.
First-party backends¶
| Backend | Ships in | capabilities | Notes |
|---|---|---|---|
worktree | core | FILE_RW, EXEC, NETWORK | Wraps the existing WorktreeManager. Zero behaviour change. Default. |
docker | core | FILE_RW, EXEC, NETWORK | Launches a container per session via the docker Python SDK. Needs pip install bernstein[docker]. |
e2b | [e2b] extra | FILE_RW, EXEC, NETWORK, SNAPSHOT | Runs in E2B Firecracker microVMs. Needs pip install bernstein[e2b] plus E2B_API_KEY. |
modal | [modal] extra | FILE_RW, EXEC, NETWORK, SNAPSHOT, GPU | Serverless containers with optional GPU. Needs pip install bernstein[modal] plus MODAL_TOKEN_ID / MODAL_TOKEN_SECRET. |
Trade-offs¶
- Latency.
worktreehas no provisioning cost;dockeradds a one-time pull plus ≤ 2 s container start;e2b/modaladd 1–3 s of cold start per session plus provider-side overhead. - Cost.
worktreeanddockerare free (local compute).e2bbills by sandbox minute.modalbills by compute seconds, with optional GPU surcharges. - Isolation.
worktreeshares the host filesystem and network;dockerprovides cgroup + namespace isolation but shares the kernel;e2bruns in a fresh Firecracker microVM per session;modalruns in dedicated serverless containers. - Capabilities. Only
e2bandmodalsupport snapshot/resume; onlymodalexposes GPU today. - Supported exec semantics. All four backends handle argv-based exec with exit-code, stdout, and stderr capture.
dockerdoes not support stdin injection in phase 1; that is tracked inoai-002b.
plan.yaml extension¶
stages:
- name: risky-execution
sandbox:
backend: docker # worktree (default), docker, e2b, modal, or a plugin name
options:
image: python:3.13-slim
memory_mb: 2048
timeout_seconds: 1800
steps:
- title: "Run untrusted code analysis"
role: security
cli: claude
sandbox: is entirely optional. When omitted the stage runs in the worktree backend — byte-identical to pre-oai-002 behaviour.
Registering a custom backend¶
Plugin authors declare an entry point in their own pyproject.toml:
[project.entry-points."bernstein.sandbox_backends"]
mybackend = "my_package.sandbox:MySandboxBackend"
On next process start the registry picks the entry up automatically. bernstein agents sandbox-backends lists every installed backend with its capability set so operators can verify registration.
Third-party backends must:
- Provide
nameandcapabilitiesclass attributes. - Implement
create,resume, anddestroyas coroutines. - Pass the conformance suite at
bernstein.core.sandbox.conformance.SandboxBackendConformance. - Import provider SDKs lazily (inside methods or behind
TYPE_CHECKING) so importing the backend module never crashes on a missing SDK.
Phased rollout¶
Phase 1 (this ticket, oai-002)¶
SandboxBackend/SandboxSession/SandboxCapability/WorkspaceManifestland insrc/bernstein/core/sandbox/.- Four first-party backends ship (worktree & docker in core; e2b & modal as optional extras).
AgentSpawnergains an optionalsandbox_sessionparameter. WhenNoneit falls back to the existing direct-worktree path. All 35 adapters continue to run unchanged.bernstein agents sandbox-backendslists installed backends.plan.yamlaccepts an optionalsandbox:block per stage.
Phase 2 (follow-up, oai-002b)¶
AgentSpawnerroutes adapter exec throughSandboxSession.exec, so the selected backend controls where subprocesses actually run.- Adapters are refactored one-by-one to use
session.read/session.writefor file I/O instead of directPath.write_bytescalls on the worktree directory. - Cost metrics and WAL events gain backend-aware labels.
Phase 2 is a mechanical but widespread refactor across 35 adapters; keeping it out of phase 1 lets the protocol land independently.
Observability (phase 1 scaffolding)¶
Each backend create/destroy cycle should emit WAL + Prometheus metrics:
sandbox_session_created{backend=..., session_id=...}sandbox_session_destroyed{backend=..., duration_seconds=...}sandbox_exec_count{backend=..., exit_code=...}
Wiring into the existing metrics/WAL subsystems is part of oai-002b; phase 1 only exposes the interfaces.
Conformance¶
SandboxBackendConformance (in src/bernstein/core/sandbox/conformance.py) is a parametrised pytest class any backend can subclass to get a complete protocol test coverage suite. Backends declaring SANDBOX_CAPABILITY.SNAPSHOT additionally get the snapshot/resume round-trip test automatically.
The worktree backend runs the conformance suite in unit tests (tests/unit/sandbox/test_backend_protocol.py). Docker / E2B / Modal conformance lives under tests/integration/sandbox/; those tests auto-skip without a live daemon or provider credentials.
Security considerations¶
worktreedoes not isolate at the kernel level. If you need to run untrusted code you must choose a sandboxed backend.dockershould be run withnetwork_disabled=Truefor untrusted workloads; the default leaves network enabled because most agent tasks legitimately need outbound HTTP.e2bandmodalrun untrusted code by design; their isolation posture is the provider's responsibility.- Snapshot IDs are opaque to callers but may contain sensitive state. Do not log them at INFO level without redaction.