Runbook Templates¶

Operational runbooks for automated remediation of common agent failure patterns.

Overview¶

Bernstein includes a runbook engine that pattern-matches agent error output against known failure modes and suggests (or automatically executes) remediation actions. When an agent fails with a recognized error, the RunbookEngine finds the first matching rule and returns the suggested fix before the task is retried.

Runbooks are not LLM-driven. They use compiled regex patterns and deterministic matching, so they add near-zero latency to the failure path.

Key concepts:

RunbookRule -- a detect pattern (regex) paired with an action (shell command or instruction).
RunbookEngine -- iterates rules in order, returns the first match, and tracks execution history.
RunbookMatch -- the result of a successful match, with captured groups interpolated into the action string.

Default rules¶

The built-in rules are loaded automatically when no custom configuration is provided. They cover the most common agent failure patterns:

Rule name	Detect pattern	Action	Auto-execute	Max retries
`import_error`	`ModuleNotFoundError: No module named '(\S+)'`	`pip install {module}`	No	1
`lint_failure`	`ruff check failed\\|Ruff.error\\|ruff.Found \d+ error`	`ruff check --fix .`	Yes	2
`port_conflict`	`Address already in use\\|EADDRINUSE.:(\d+)\\|port (\d+).in use`	`lsof -ti:{port} \\| xargs kill -9`	No	1
`type_error`	`TypeError: .+ got an unexpected keyword argument '(\S+)'`	Check function signature for argument `{module}`	No	1
`permission_denied`	`PermissionError\\|Permission denied`	Check file permissions on affected paths	No	1
`git_conflict`	`CONFLICT \(content\)\\|merge conflict\\|Merge conflict`	Resolve merge conflicts in affected files	No	1
`rate_limit`	`rate.?limit\\|429\\|Too Many Requests\\|throttl`	Wait and retry with exponential backoff	No	3
`disk_space`	`No space left on device\\|ENOSPC\\|disk full`	Free disk space: clean build artifacts, tmp files	No	1
`timeout`	`TimeoutError\\|timed? ?out\\|deadline exceeded`	Retry with increased timeout or reduced scope	No	2
`test_failure`	`FAILED tests/\\|pytest.*failed\\|AssertionError`	Review test output and fix failing assertions	No	2

Rules are evaluated in the order listed. The first match wins.

Custom rules via JSON config¶

Override or extend the default rules by placing a JSON config file and loading it with RunbookEngine.load_rules():

{
  "runbooks": [
    {
      "name": "oom_kill",
      "detect": "Killed|Out of memory|MemoryError",
      "action": "Reduce batch size or increase memory limit",
      "auto_execute": false,
      "max_retries": 1
    },
    {
      "name": "docker_build_fail",
      "detect": "docker build.*failed|Error response from daemon",
      "action": "docker system prune -f && docker build --no-cache .",
      "auto_execute": false,
      "max_retries": 2
    }
  ]
}

Save this as .sdd/config/runbooks.json and the engine picks it up on next run. If the file is absent or malformed, the engine falls back to the built-in defaults.

Field reference¶

Field	Type	Required	Default	Description
`name`	string	yes	--	Unique identifier for the rule
`detect`	string	yes	--	Regex pattern matched against agent error text
`action`	string	yes	--	Shell command or human-readable instruction
`auto_execute`	bool	no	`false`	If true, the action runs without confirmation
`max_retries`	int	no	`2`	Maximum automatic retries for this failure

Capture groups in the detect regex are interpolated into the action string using {module}, {port}, or {file} placeholders.

Operational scenario templates¶

Over-budget recovery¶

An orchestration run is approaching or has exceeded its spending cap.

Indicators: - Token monitor reports cost > 80% of budget_usd - Cost anomaly detector fires alerts

Runbook steps:

Check current spend: GET /status and inspect cost_usd in the response.
Identify the highest-cost agents from .sdd/cost/ledger.jsonl.
Reduce parallelism: lower max_agents in bernstein.yaml to slow burn rate.
Switch remaining tasks to cheaper models via model_policy overrides.
If budget is already exhausted, stop the run: bernstein stop.

Custom rule example:

{
  "name": "budget_warning",
  "detect": "budget.*exceeded|cost.*limit|spending.*cap",
  "action": "Pause non-critical tasks and reduce max_agents to 2",
  "auto_execute": false,
  "max_retries": 1
}

Agent crash loop¶

An agent repeatedly fails on the same task, burning retries without progress.

Indicators: - Task retry count reaches max_retries (default 3) - Same error pattern appears in consecutive attempts - Runbook execution stats show repeated matches for the same rule

Runbook steps:

Check the task trace: bernstein trace <task-id>.
Look for repeated runbook matches in .sdd/metrics/runbook_log.jsonl.
If the error is environmental (disk, network, permissions), fix the root cause and retry.
If the error is in agent-generated code, mark the task as failed and create a simpler subtask.
Consider escalating to a different model or role.

Custom rule example:

{
  "name": "crash_loop_breaker",
  "detect": "retry_count.*exceeded|max retries reached",
  "action": "Fail task permanently and notify operator",
  "auto_execute": false,
  "max_retries": 0
}

Merge conflict storm¶

Multiple agents modify overlapping files, producing a burst of merge conflicts.

Indicators: - git_conflict runbook rule fires for 3+ tasks within a short window - Bulletin board shows multiple "blocker" entries referencing merge conflicts - Janitor merge queue backs up

Runbook steps:

Pause new task assignments: reduce max_agents to 1 temporarily.
Drain the merge queue: let the janitor process pending merges one at a time.
Identify overlapping file scopes by checking owned_files on active tasks.
Reassign conflicting tasks to run sequentially (add depends_on relationships).
Resume normal parallelism after the conflict burst clears.

Custom rule example:

{
  "name": "conflict_storm",
  "detect": "CONFLICT.*CONFLICT.*CONFLICT|multiple merge conflicts",
  "action": "Reduce max_agents to 1 and drain merge queue before resuming",
  "auto_execute": false,
  "max_retries": 1
}

High failure rate¶

The overall task failure rate climbs above an acceptable threshold.

Indicators: - Dashboard (GET /status) shows failure rate > 30% - Multiple different runbook rules are firing across tasks - Agent utilization drops as tasks pile up in failed state

Runbook steps:

Pull failure summary: bernstein status or GET /status.
Group failures by error pattern using .sdd/metrics/runbook_log.jsonl.
If failures cluster around a single cause (e.g., a broken dependency), fix that cause first.
If failures are diverse, check for environmental issues: disk space, network, API rate limits.
Consider stopping the run, fixing the environment, and restarting: bernstein stop && bernstein run.

Custom rule example:

{
  "name": "high_failure_rate",
  "detect": "failure rate.*above threshold|too many failures",
  "action": "Stop run, audit .sdd/metrics, fix root cause, then restart",
  "auto_execute": false,
  "max_retries": 1
}

Execution log¶

All runbook executions are persisted to .sdd/metrics/runbook_log.jsonl as newline-delimited JSON. Each entry contains:

{
  "rule_name": "lint_failure",
  "task_id": "task-abc123",
  "action": "ruff check --fix .",
  "timestamp": 1712345678.9,
  "success": true,
  "output": "Fixed 3 errors."
}

Use RunbookEngine.get_stats() to retrieve aggregated execution statistics (total executions, successes, and failures grouped by rule name).