Model Routing & Escalation¶
Does Bernstein actually escalate failed tasks to a more capable model?
Yes. Bernstein ships two distinct cascade systems, both 100% implemented and wired into the orchestrator:
System 1 — CascadeRouter | System 2 — CascadeFallbackManager | |
|---|---|---|
| Scope | Intra-Claude tier escalation | Cross-adapter failover |
| Trigger | Task failure, janitor failure, low-confidence output | Rate limit, timeout, provider error |
| Walks | sonnet → opus (and historical haiku → sonnet → opus) | opus → sonnet → codex → gemini → qwen |
| Source | core/routing/cascade_router.py | core/routing/cascade.py |
This document explains both, the configuration knobs, and the observability surface. For the higher-level provider-policy controls (allow/deny lists, preferred provider) see operations/MODEL_POLICY.md.
System 1 — Intra-Claude tier escalation (CascadeRouter)¶
CascadeRouter (src/bernstein/core/routing/cascade_router.py:246) selects the cheapest viable Claude tier for a task's first attempt and escalates only when post-hoc signals warrant it.
When it runs¶
The router is consulted when the chosen adapter is Claude-compatible (router_applicable() line 294 — checks the adapter name against the {claude, claude code, claude_code, claude-code} set). For every other adapter the router is skipped — its arms (sonnet, opus) are Claude-specific.
The tier ladder¶
CASCADE is defined in src/bernstein/core/cost/cost.py:119:
_cascade_for_task() (cascade_router.py:681-700) returns the same list for both standard and high-stakes paths today. Historical builds (and the docstring at cascade_router.py:14) included haiku → sonnet → opus for standard tasks and sonnet → opus for high-stakes; the haiku tier was dropped after observing that on the Anthropic Max plan sonnet is unlimited and produces measurably better results (cost.py:117-119).
A task is treated as high-stakes when any of:
task.role∈{"manager", "architect", "security"}task.complexity == Complexity.HIGHtask.scope == Scope.LARGEtask.priority == 1(highest)
(See cascade_router.py:579-583 and _cascade_for_task lines 693-699.)
Initial selection¶
select(task) (cascade_router.py:330) returns a CascadeDecision:
@dataclass
class CascadeDecision:
model: str # e.g. "sonnet"
effort: str # "low" | "high" | "max"
attempt_number: int # 0 for first attempt
is_escalated: bool
reason: str # human-readable explanation
estimated_cost_usd: float
chain_id: str # opaque id; pass back to record_and_escalate()
_select_initial_model() (cascade_router.py:570-618) decides:
- High-stakes role/complexity/scope/priority → start at
cascade[0](currentlysonnet). - Manager-supplied
task.modeloverride → use it if it appears in the cascade. - Otherwise consult the bandit (next section). If the cheapest tier has been observed
≥ MIN_OBSERVATIONStimes and its observedsuccess_rate < QUALITY_THRESHOLD, skip it and try the next tier — a proactive cost-then-quality move.
Escalation triggers¶
After the agent finishes, record_and_escalate() (cascade_router.py:386) records the attempt and decides whether to retry on a higher tier. Triggers, evaluated in order by _should_escalate() (lines 639-673):
- Hard task failure.
attempt.success == Falsewith no other info → escalate. (Line 655-657.) - Janitor verification failure.
janitor_passed is False→ escalate. (Line 660-661.) - Low-confidence output. A regex over the last 2 000 chars of agent output (
_LOW_CONFIDENCE_PATTERNat lines 67-87) matches phrases like "I'm not sure", "I cannot determine", "partial implementation", "left as placeholder", "TODO: escalat...".detect_low_confidence()at line 371. → escalate. - Late explicit failure.
attempt.success == Falseeven after janitor or output checks → escalate.
If escalation fires and the chain is not already at the top of the cascade (current_idx < len(cascade) - 1, line 448-455), the router returns a new CascadeDecision for the next tier. Otherwise the chain ends and the failure stands.
The bandit¶
EpsilonGreedyBandit lives in src/bernstein/core/cost/cost.py. It keeps one BanditArm per (role, model) pair, recording observations (success/failure, cost, latency) over the run. Constants:
EPSILON = 0.1 # 10% explore, 90% exploit
MIN_OBSERVATIONS = 5 # arms trusted only after this many samples
QUALITY_THRESHOLD = 0.80 # min success_rate to consider an arm
(cost.py:40-42.)
record_and_escalate() calls _record_bandit() (line 559) after every attempt, so the bandit's view of each arm improves with use. The proactive-skip rule in _select_initial_model() reads arm.success_rate and arm.observations directly (lines 594-614).
A new arm starts with a pessimistic success_rate = 0.5 (cost.py:139) so a freshly added cheap model cannot greedily win selection on its first observation — it has to earn the trust through real successes.
The bandit's persisted state is loaded lazily via EpsilonGreedyBandit.load(metrics_dir) (cascade_router.py:553-554) and saved on demand via save_bandit() (line 541-544).
Persistence — cascade_chains.jsonl¶
save_chain(chain_id, task, metrics_dir) (cascade_router.py:518-539) appends one JSON line per completed chain to:
Sample line (whitespace added for readability):
{
"timestamp": 1714824312.5,
"chain_id": "a1b2c3d4e5f6789a",
"task_id": "T-042",
"role": "backend",
"attempts": [
{"model": "sonnet", "effort": "high", "attempt_number": 0,
"cost_usd": 0.042, "latency_s": 87.3, "success": false,
"escalated": true,
"escalation_reason": "low-confidence signal in output: 'partial implementation'"},
{"model": "opus", "effort": "max", "attempt_number": 1,
"cost_usd": 0.612, "latency_s": 156.0, "success": true,
"escalated": false}
],
"final_model": "opus",
"succeeded": true,
"total_cost_usd": 0.654,
"first_attempt_cost_usd": 0.042,
"escalation_overhead_usd": 0.612,
"saved_vs_direct_opus_usd": 0.546
}
Aggregate stats are computed on demand by load_cascade_savings_summary(metrics_dir) (cascade_router.py:725-):
{
"total_chains": 1247,
"total_cost_usd": 412.93,
"escalation_overhead_usd": 28.41,
"saved_vs_opus_usd": 1153.04,
"escalation_rate": 0.073 # 7.3% of chains escalated past tier 0
}
System 2 — Cross-adapter failover (CascadeFallbackManager)¶
CascadeFallbackManager (src/bernstein/core/routing/cascade.py:108-) handles a different problem: the chosen provider is unavailable (rate limit, timeout, API error) and we need to redirect the task to a different provider entirely.
Default cascade order¶
DEFAULT_CASCADE_ORDER (cascade.py:61):
Each entry resolves to a provider via _MODEL_TO_PROVIDER (cascade.py:64-80). When find_fallback() (line 287) is called, it walks this list starting after the current entry and returns the first viable agent.
When it runs¶
Called by the orchestrator/spawner when an agent process raises rate-limit (HTTP 429), timeout, or generic API error (cascade.py:11-13 lists the triggers). The trigger argument records which condition fired so metrics can split by cause.
Capability floor¶
Cross-adapter fallback never violates the task's reasoning floor — fallback to a weak model just because the strong one is throttled would silently degrade output quality:
CAPABILITY_FLOOR: dict[Complexity, int] = {
Complexity.HIGH: _STRENGTH_ORDER["high"], # only high or very_high
Complexity.MEDIUM: _STRENGTH_ORDER["medium"],
Complexity.LOW: _STRENGTH_ORDER["low"],
}
(cascade.py:44-48.) Candidates below the floor are skipped (_is_viable_candidate at line 349-380).
Sticky fallback¶
To prevent ping-pong between primary and fallback, once a fallback is selected it sticks for _DEFAULT_STICKY_DURATION_S = 300.0 seconds (cascade.py:83). find_fallback() checks get_sticky_fallback() first (lines 311-331) and reuses it when the window has not yet expired.
Cascade chain inspection¶
find_fallback_chain(complexity, initial_provider) (cascade.py:503-526) walks the entire cascade and returns the full sequence of fallback decisions that would be tried if each provider in turn were rate-limited. Useful for audit logs and debugging.
Configuration¶
YAML (bernstein.yaml)¶
quality_gates:
enabled: true # cascade router gates on janitor + lint + tests
lint: true
# Per-role default model and effort. CascadeRouter reads task.model /
# task.effort from this when the manager has not set them.
role_model_policy:
manager: { cli: claude, model: opus, effort: max }
backend: { cli: claude, model: sonnet, effort: high }
qa: { cli: claude, model: sonnet, effort: high }
# Provider-level gating runs *before* either cascade.
# See operations/MODEL_POLICY.md.
model_policy:
allowed_providers: [anthropic, openai, google]
prefer: anthropic
CLI flags (src/bernstein/cli/main.py:482)¶
| Flag | Effect |
|---|---|
--routing {default,bandit} | Pick the bandit or the static-cost router. |
--model <name> | Override the initial model for this run. |
--budget <usd> | Cap per-run spend; cascade refuses to escalate past it. |
Observability¶
cascade_chains.jsonl¶
Schema documented above. The simplest way to inspect:
jq '. | {role: .role, attempts: (.attempts | length), saved: .saved_vs_direct_opus_usd}' \
.sdd/metrics/cascade_chains.jsonl | tail
/routing/bandit HTTP endpoint¶
Exposed by core/routes/status_dashboard.py:905-920:
Returns per-(role, model) bandit state read from .sdd/routing/ (populated when --routing bandit is active). Sample shape:
{
"active": true,
"mode": "bandit",
"arms": {
"backend|sonnet": {"observations": 124, "success_rate": 0.91,
"avg_cost_usd": 0.039, "avg_latency_s": 78.4},
"backend|opus": {"observations": 12, "success_rate": 1.00,
"avg_cost_usd": 0.61, "avg_latency_s": 142.7}
}
}
Empty {} when bandit routing has not been activated.
Logs¶
The router logs every escalation at INFO:
Cascade chain a1b2c3d4...: escalating sonnet → opus
(reason: low-confidence signal in output: 'partial implementation')
(cascade_router.py:462-468.) Cross-adapter fallback logs at INFO when a provider is selected and at WARNING when the chain is exhausted (cascade.py:436-440).
Aggregate savings summary¶
from pathlib import Path
from bernstein.core.routing.cascade_router import load_cascade_savings_summary
summary = load_cascade_savings_summary(Path(".sdd/metrics"))
print(summary["saved_vs_opus_usd"], summary["escalation_rate"])
How the two systems interact¶
A single task's run can touch both:
CascadeRouter.select(task) → sonnet (intra-Claude)
adapter.run(...) → 429 from Anthropic
CascadeFallbackManager.find_fallback( → (codex, gpt-5.4)
excluded={anthropic}, current_entry="sonnet", trigger="rate_limit")
adapter[codex].run(...) → success but janitor fails
CascadeRouter.record_and_escalate(janitor_passed=False)
→ opus (back inside Claude)
The two routers do not coordinate state. CascadeRouter cares about output quality; CascadeFallbackManager cares about provider availability.
Code pointers¶
| Concern | File | Symbol / line |
|---|---|---|
| Cascade router selection | src/bernstein/core/routing/cascade_router.py | CascadeRouter.select:330 |
| Initial model picker | src/bernstein/core/routing/cascade_router.py | _select_initial_model:570-618 |
| Escalation decision | src/bernstein/core/routing/cascade_router.py | record_and_escalate:386, _should_escalate:639-673 |
| Low-confidence regex | src/bernstein/core/routing/cascade_router.py | _LOW_CONFIDENCE_PATTERN:67-87, detect_low_confidence:371 |
| Tier ladder | src/bernstein/core/cost/cost.py | CASCADE:119 |
_cascade_for_task (high-stakes branch) | src/bernstein/core/routing/cascade_router.py | _cascade_for_task:681-700 |
| Bandit arm | src/bernstein/core/cost/cost.py | EpsilonGreedyBandit, constants EPSILON/MIN_OBSERVATIONS/QUALITY_THRESHOLD:40-42 |
| Proactive-skip rule | src/bernstein/core/routing/cascade_router.py | _select_initial_model:594-614 |
| Chain persistence | src/bernstein/core/routing/cascade_router.py | save_chain:518, CHAIN_FILE:288 |
| Aggregate savings | src/bernstein/core/routing/cascade_router.py | load_cascade_savings_summary:725 |
| Cross-adapter fallback | src/bernstein/core/routing/cascade.py | CascadeFallbackManager, find_fallback:287-347 |
| Default cascade order | src/bernstein/core/routing/cascade.py | DEFAULT_CASCADE_ORDER:61 |
| Capability floor | src/bernstein/core/routing/cascade.py | CAPABILITY_FLOOR:44-48, _is_viable_candidate:349-380 |
| Sticky fallback | src/bernstein/core/routing/cascade.py | _DEFAULT_STICKY_DURATION_S:83, get_sticky_fallback:311-331 |
/routing/bandit endpoint | src/bernstein/core/routes/status_dashboard.py | bandit_routing_stats:905 |
--routing CLI flag | src/bernstein/cli/main.py | :482 |
Related¶
operations/MODEL_POLICY.md— provider-level allow/deny policy that runs before either cascade decides anything.architecture/quality-pipeline.md— janitor + gates that emit thejanitor_passedsignal the cascade router consumes.architecture/state-persistence.md— where bandit state and cascade chain reports live on disk.