Runbook Templates¶
Operational runbooks for automated remediation of common agent failure patterns.
Overview¶
Bernstein includes a runbook engine that pattern-matches agent error output against known failure modes and suggests (or automatically executes) remediation actions. When an agent fails with a recognized error, the RunbookEngine finds the first matching rule and returns the suggested fix before the task is retried.
Runbooks are not LLM-driven. They use compiled regex patterns and deterministic matching, so they add near-zero latency to the failure path.
Key concepts:
- RunbookRule -- a detect pattern (regex) paired with an action (shell command or instruction).
- RunbookEngine -- iterates rules in order, returns the first match, and tracks execution history.
- RunbookMatch -- the result of a successful match, with captured groups interpolated into the action string.
Default rules¶
The built-in rules are loaded automatically when no custom configuration is provided. They cover the most common agent failure patterns:
| Rule name | Detect pattern | Action | Auto-execute | Max retries |
|---|---|---|---|---|
import_error | ModuleNotFoundError: No module named '(\S+)' | pip install {module} | No | 1 |
lint_failure | ruff check failed\|Ruff.*error\|ruff.*Found \d+ error | ruff check --fix . | Yes | 2 |
port_conflict | Address already in use\|EADDRINUSE.*:(\d+)\|port (\d+).*in use | lsof -ti:{port} \| xargs kill -9 | No | 1 |
type_error | TypeError: .+ got an unexpected keyword argument '(\S+)' | Check function signature for argument {module} | No | 1 |
permission_denied | PermissionError\|Permission denied | Check file permissions on affected paths | No | 1 |
git_conflict | CONFLICT \(content\)\|merge conflict\|Merge conflict | Resolve merge conflicts in affected files | No | 1 |
rate_limit | rate.?limit\|429\|Too Many Requests\|throttl | Wait and retry with exponential backoff | No | 3 |
disk_space | No space left on device\|ENOSPC\|disk full | Free disk space: clean build artifacts, tmp files | No | 1 |
timeout | TimeoutError\|timed? ?out\|deadline exceeded | Retry with increased timeout or reduced scope | No | 2 |
test_failure | FAILED tests/\|pytest.*failed\|AssertionError | Review test output and fix failing assertions | No | 2 |
Rules are evaluated in the order listed. The first match wins.
Custom rules via JSON config¶
Override or extend the default rules by placing a JSON config file and loading it with RunbookEngine.load_rules():
{
"runbooks": [
{
"name": "oom_kill",
"detect": "Killed|Out of memory|MemoryError",
"action": "Reduce batch size or increase memory limit",
"auto_execute": false,
"max_retries": 1
},
{
"name": "docker_build_fail",
"detect": "docker build.*failed|Error response from daemon",
"action": "docker system prune -f && docker build --no-cache .",
"auto_execute": false,
"max_retries": 2
}
]
}
Save this as .sdd/config/runbooks.json and the engine picks it up on next run. If the file is absent or malformed, the engine falls back to the built-in defaults.
Field reference¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | yes | -- | Unique identifier for the rule |
detect | string | yes | -- | Regex pattern matched against agent error text |
action | string | yes | -- | Shell command or human-readable instruction |
auto_execute | bool | no | false | If true, the action runs without confirmation |
max_retries | int | no | 2 | Maximum automatic retries for this failure |
Capture groups in the detect regex are interpolated into the action string using {module}, {port}, or {file} placeholders.
Operational scenario templates¶
Over-budget recovery¶
An orchestration run is approaching or has exceeded its spending cap.
Indicators: - Token monitor reports cost > 80% of budget_usd - Cost anomaly detector fires alerts
Runbook steps:
- Check current spend:
GET /statusand inspectcost_usdin the response. - Identify the highest-cost agents from
.sdd/metrics/cost_ledger.jsonl. - Reduce parallelism: lower
max_agentsinbernstein.yamlto slow burn rate. - Switch remaining tasks to cheaper models via
model_policyoverrides. - If budget is already exhausted, stop the run:
bernstein stop.
Custom rule example:
{
"name": "budget_warning",
"detect": "budget.*exceeded|cost.*limit|spending.*cap",
"action": "Pause non-critical tasks and reduce max_agents to 2",
"auto_execute": false,
"max_retries": 1
}
Agent crash loop¶
An agent repeatedly fails on the same task, burning retries without progress.
Indicators: - Task retry count reaches max_retries (default 3) - Same error pattern appears in consecutive attempts - Runbook execution stats show repeated matches for the same rule
Runbook steps:
- Check the task trace:
bernstein trace <task-id>. - Look for repeated runbook matches in
.sdd/metrics/runbook_log.jsonl. - If the error is environmental (disk, network, permissions), fix the root cause and retry.
- If the error is in agent-generated code, mark the task as failed and create a simpler subtask.
- Consider escalating to a different model or role.
Custom rule example:
{
"name": "crash_loop_breaker",
"detect": "retry_count.*exceeded|max retries reached",
"action": "Fail task permanently and notify operator",
"auto_execute": false,
"max_retries": 0
}
Merge conflict storm¶
Multiple agents modify overlapping files, producing a burst of merge conflicts.
Indicators: - git_conflict runbook rule fires for 3+ tasks within a short window - Bulletin board shows multiple "blocker" entries referencing merge conflicts - Janitor merge queue backs up
Runbook steps:
- Pause new task assignments: reduce
max_agentsto 1 temporarily. - Drain the merge queue: let the janitor process pending merges one at a time.
- Identify overlapping file scopes by checking
owned_fileson active tasks. - Reassign conflicting tasks to run sequentially (add
depends_onrelationships). - Resume normal parallelism after the conflict burst clears.
Custom rule example:
{
"name": "conflict_storm",
"detect": "CONFLICT.*CONFLICT.*CONFLICT|multiple merge conflicts",
"action": "Reduce max_agents to 1 and drain merge queue before resuming",
"auto_execute": false,
"max_retries": 1
}
High failure rate¶
The overall task failure rate climbs above an acceptable threshold.
Indicators: - Dashboard (GET /status) shows failure rate > 30% - Multiple different runbook rules are firing across tasks - Agent utilization drops as tasks pile up in failed state
Runbook steps:
- Pull failure summary:
bernstein statusorGET /status. - Group failures by error pattern using
.sdd/metrics/runbook_log.jsonl. - If failures cluster around a single cause (e.g., a broken dependency), fix that cause first.
- If failures are diverse, check for environmental issues: disk space, network, API rate limits.
- Consider stopping the run, fixing the environment, and restarting:
bernstein stop && bernstein run.
Custom rule example:
{
"name": "high_failure_rate",
"detect": "failure rate.*above threshold|too many failures",
"action": "Stop run, audit .sdd/metrics, fix root cause, then restart",
"auto_execute": false,
"max_retries": 1
}
Execution log¶
All runbook executions are persisted to .sdd/metrics/runbook_log.jsonl as newline-delimited JSON. Each entry contains:
{
"rule_name": "lint_failure",
"task_id": "task-abc123",
"action": "ruff check --fix .",
"timestamp": 1712345678.9,
"success": true,
"output": "Fixed 3 errors."
}
Use RunbookEngine.get_stats() to retrieve aggregated execution statistics (total executions, successes, and failures grouped by rule name).