Disaster recovery¶
Audience: SREs planning a DR runbook for a Bernstein-managed project — backups, restore drills, RPO/RTO targets, and the forward-looking cross-region replication plan.
What: bernstein dr produces and consumes encrypted tarballs of the durable .sdd/ state. Combined with the WAL and checkpoint subsystem, that backup is sufficient to bring a replacement orchestrator back to a recent consistent state without rebuilding from a remote git remote.
Why: A Bernstein workspace is the source of truth for in-flight backlog, claim ownership, audit chain, WAL hashes, cost ledger, and cascade-router metrics. A lost workspace is not just lost code — it's lost determinism and lost audit. Cross-link: State persistence for the durability model that makes restore safe.
Bernstein's recovery model (5-second mental model)¶
Two complementary mechanisms:
- Write-Ahead Log (WAL) — every orchestrator decision (claim, spawn, complete, fail, merge) is appended to a hash-chained JSONL file with
fsync()before the action runs. On startup,wal_replayfinds entries marked uncommitted and re-executes them after consulting the idempotency store, so a crashed orchestrator wakes up consistent. Source:src/bernstein/core/persistence/wal.py:1-15,src/bernstein/core/persistence/wal_replay.py:1-18. - Checkpointing — periodic atomic snapshots of full orchestrator state (task graph + agent sessions + cost accumulator + WAL sequence position) at
.sdd/runtime/checkpoints/checkpoint-<id>.jsonplus operator-visible "where we are" rows at.sdd/sessions/<ts>-checkpoint.json. Source:src/bernstein/core/persistence/checkpoint.py:1-19.
Backups bundle both, plus everything else listed in §"Backup contents" below.
bernstein dr group¶
Source: src/bernstein/cli/commands/disaster_recovery_cmd.py.
dr backup --to <path>¶
Bundle the durable subset of .sdd/ into a gzipped tarball. Optional symmetric encryption via Fernet (PBKDF2-SHA256, 600k iterations).
$ bernstein dr backup --to ./bernstein-backup-2026-05-04.tar.gz
Backing up .sdd to ./bernstein-backup-2026-05-04.tar.gz...
Backup complete!
Path: ./bernstein-backup-2026-05-04.tar.gz
Size: 18437298 bytes
Files: 8421
SHA256: 2f9a17b8c4e3d501...
Encrypted (recommended for off-site copy):
Flags:
--to <path>— required destination. Encrypted output gets.encsuffix appended automatically.--encrypt— wraps the tarball in Fernet ciphertext. Requires--password(a missing password is rejected because the random key would be unrecoverable —src/bernstein/core/persistence/disaster_recovery.py:200-201).--password <str>— passphrase fed through PBKDF2 with a fresh 16-byte salt prepended to the ciphertext.--sdd <path>— override the source.sdd/directory (defaults to./.sdd).
dr restore --from <path>¶
Re-hydrate .sdd/ from a backup. --dry-run lists the contained files without extracting; use it as a verify step before pointing it at a destination directory.
$ bernstein dr restore --from ./bk.tar.gz --dry-run
Dry run — listing contents of ./bk.tar.gz:
Files: 8421
Source: ./bk.tar.gz
SHA256: 2f9a17b8c4e3d501...
Files in backup:
manifest.json
backlog/open/task-...
runtime/wal/run-2026-05-04.wal.jsonl
...
Encrypted backup:
Flags:
--from <path>— required source (.tar.gzor.tar.gz.enc).--decrypt+--password— only when the backup was encrypted.--dry-run— list contents and report SHA256, no writes.--sdd <path>— destination override.
Verify¶
There is no dedicated dr verify subcommand today. The supported drill is:
bernstein dr backup --to ./drill.tar.gz- Move to a scratch dir.
bernstein dr restore --from ./drill.tar.gz --dry-run- Confirm
Files:matches the backup's printedfile_count. - Optionally
bernstein dr restore --from ./drill.tar.gz --sdd /tmp/.sdd-restoredthenbernstein doctor --workspace /tmpto revalidate.
For ad-hoc integrity checks, the WAL itself is hash-chained — running the orchestrator against a restored workspace will fail loudly if the chain is broken (src/bernstein/core/persistence/wal.py:36-38).
Backup contents¶
Defined in src/bernstein/core/persistence/disaster_recovery.py:46-123.
Included (_BACKUP_DIRS):
| Path | Why |
|---|---|
backlog/{open,done,closed,deferred,manual} | Task claim state and history |
metrics | Cascade-router bandit history, SLO budgets |
traces | Distributed traces |
memory | Persistent memory store |
sessions | Operator-visible checkpoints |
decisions | ADR-style decision logs |
docs, config | In-tree docs and resolved config |
archive, agents, index | Agent registry + indexes |
caching, models | Cache state |
audit | HMAC-chained audit log |
runs | Per-run reports |
runtime/ | WAL, file locks, sessions, team state, task graph |
Excluded (_EXCLUDE_DIRS + _EXCLUDE_PATTERNS): rotated logs, worktrees (regenerable), debug dumps, research caches, in-flight signals, PID files, heartbeats, kill markers, runtime/*.log, runtime/*.pid, access.jsonl*, retrospective.md, summary.md. The skip-list is the boundary between "warm restart" and "everything you can rebuild on the fly".
The tarball's root contains a manifest.json with the inclusion lists, exclusion patterns, and created_at epoch (src/bernstein/core/persistence/disaster_recovery.py:207-223).
Restore procedure (step-by-step)¶
For a complete loss of the orchestrator host:
- Provision a fresh node with the same Bernstein version. Mismatched versions can run, but tail your
bernstein doctoroutput for compatibility warnings. - Copy the latest backup tarball to the new host (encrypted on the wire — these tarballs contain audit secrets and credential vault blobs).
- Decrypt + extract:
- Restore secrets: the credential vault is included (
runtime/), but provider API keys live in environment variables (see env-isolation.md). Re-export those out of band. - Start the orchestrator:
bernstein start. WAL replay runs automatically — every uncommitted entry from the previous instance replays through the idempotency store, so spawned-but-not-completed tasks finish correctly without duplicate side effects (src/bernstein/core/persistence/wal_replay.py:1-18). - Verify:
bernstein status— task counts match pre-incident.bernstein audit verify— hash chain intact.bernstein dr backup --to /tmp/drill.tar.gz --dry-run(sanity).- Resume external triggers: if any cron/CI/webhook was paused during failover, re-enable now.
For a partial loss (workspace corruption with the host alive):
bernstein stopto drain.mv .sdd .sdd.broken && bernstein dr restore --from /backup/....bernstein start. WAL replay handles inconsistencies.
RPO / RTO expectations¶
| Metric | Target | Notes |
|---|---|---|
| RPO (Recovery Point Objective) | = backup cadence | Operator-set. Default cron: hourly snapshots in production, 6 h elsewhere. |
| RTO (Recovery Time Objective) | < 15 min for the dr-restore step itself, plus your provisioning | Restore is a tar -xzf plus fsync — IO-bound, not CPU-bound. |
WAL fsync per entry guarantees zero-loss for committed state — every decision is durable before it executes. The RPO gap is the time between your last dr backup and the incident: WAL itself is included in the backup, so a restored workspace replays uncommitted entries forward and loses only the bound but un-fsynced runtime telemetry (heartbeats, log tails, metrics counts). Cost ledger and audit chain are preserved.
Cross-region considerations¶
Multi-region replication of the WAL is draft / not shipped as of 2026-05-04. The design lives in dev/specs/internal-workflows/WORKFLOW-disaster-recovery-cross-region.md (ENT-010, status: Draft) and is partially scaffolded in src/bernstein/core/persistence/wal_replication.py:1-12 (pull-based follower model, LEADER_ONLY / QUORUM / ALL ack policies, lag tracking).
In production today the supported pattern is:
- Schedule
bernstein dr backup --to s3://...from cron (or your scheduler of choice) every 1 h. Push to a different region than the orchestrator host. - Encrypt with a passphrase you store in your secrets manager (not in the same region as the orchestrator host).
- Drill restore quarterly into a scratch host. A backup you have not restored is not a backup.
When ENT-010 lands the operational picture changes — followers maintain a continuous WAL stream and the RPO target drops to "lag entries". Until then, treat backup cadence as your RPO floor.
Code pointers¶
src/bernstein/cli/commands/disaster_recovery_cmd.py— CLI surfacesrc/bernstein/core/persistence/disaster_recovery.py:1-22— design rationale + usagesrc/bernstein/core/persistence/disaster_recovery.py:46-123— included/excluded pathssrc/bernstein/core/persistence/disaster_recovery.py:139-174— Fernet/PBKDF2 cryptosrc/bernstein/core/persistence/disaster_recovery.py:177-275—backup_sddsrc/bernstein/core/persistence/disaster_recovery.py:278-366—restore_sdd(withfilter="data"traversal guard)src/bernstein/core/persistence/wal.py:1-67— WAL writer, hash-chain, fsync invariantssrc/bernstein/core/persistence/wal_replay.py:1-78— replay pipeline +IdempotencyStoresrc/bernstein/core/persistence/checkpoint.py:1-79—Checkpoint(atomic) +PartialState(operator)src/bernstein/core/persistence/wal_replication.py:1-60— ENT-010 scaffold (Draft)dev/specs/internal-workflows/WORKFLOW-disaster-recovery-cross-region.md— full ENT-010 design (internal)