Cost Optimization Guide¶
Strategies for reducing API spend and compute costs when running Bernstein.
Model selection¶
Model cost is typically the dominant expense. Bernstein selects models per-task based on complexity.
Model tiers¶
| Tier | Use case | Relative cost |
|---|---|---|
| Haiku / GPT-4o-mini | Simple refactors, formatting, test fixes | 1x |
| Sonnet / GPT-4o | Standard features, bug fixes, moderate complexity | 5-10x |
| Opus / o1 | Architecture decisions, complex multi-file changes | 20-50x |
Configuring model routing¶
The orchestrator maps task complexity to models automatically. Default mapping (from src/bernstein/core/defaults.py, PLAN.model_by_complexity):
| Complexity | Default model |
|---|---|
low | haiku |
medium | sonnet |
high | opus |
Override per-task in plan files:
steps:
- title: "Fix typo in README"
complexity: low # Routes to haiku
- title: "Redesign auth system"
complexity: high # Routes to opus
Cost impact of model choice¶
For a typical 100-task project: - All opus: ~$50-150 - Mixed (default policy): ~$15-40 - All haiku: ~$3-8
The mixed approach usually delivers the best quality-to-cost ratio.
Prompt caching¶
Bernstein supports provider prompt caching to reduce token costs on repeated context.
Caching is most effective when: - Multiple agents share the same role (e.g., several backend agents) - Project context files are stable between tasks - The same codebase analysis is reused across tasks
Expected savings: 30-60% on input tokens for cached portions.
Batch API usage¶
For non-urgent workloads, use batch API pricing (typically 50% cheaper):
Batch mode collects tasks and submits them together. Latency increases but cost decreases. Appropriate for overnight runs, CI/CD, and non-interactive workflows.
Token budget management¶
Set per-task and per-run token budgets to prevent runaway costs:
budget:
per_task_max_tokens: 100000
per_run_max_tokens: 2000000
per_run_max_cost_usd: 50.00
alert_threshold_pct: 80
When a task exceeds its token budget, it is terminated gracefully with partial results saved. The orchestrator can retry with a cheaper model.
Cheaper retry strategy¶
When a task fails or exceeds budget on an expensive model, retry it with a cheaper model. Many tasks succeed with less capable models on the second attempt because the first attempt's partial work narrows the problem.
Context window optimization¶
Scope and file isolation¶
Use scope (small, medium, large) and files to limit context per task:
steps:
- title: "Add validation to User model"
scope: small
files:
- "src/models/user.py"
- "tests/test_user.py"
Narrow scopes and explicit file lists mean: - Fewer files loaded into context - Fewer tokens consumed - Lower cost per task - Better conflict detection (via files field)
Context compression¶
Enable context compression to reduce token usage for large codebases:
Spend visibility¶
Bernstein tracks token usage and provides cost breakdowns via bernstein cost:
# View current run cost
bernstein cost
# View cost breakdown by model
bernstein cost --by model
# View cost breakdown by agent
bernstein cost --by agent
# View cost for last 24 hours
bernstein cost --last 24h
# JSON output for scripting
bernstein cost --json
Cost forecasting is handled by src/bernstein/core/cost/spend_forecast.py and src/bernstein/core/cost/cost_forecast.py.
Peak-hour routing¶
Route non-urgent tasks to off-peak windows to reduce costs:
# Configured in bernstein.yaml or via src/bernstein/core/cost/peak_hour_router.py
peak_hour_routing:
enabled: true
Cost monitoring and alerts¶
Budget alerts¶
notifications:
channels:
- type: slack
webhook_url: "https://hooks.slack.com/..."
alerts:
- event: budget.threshold
threshold_pct: 80
- event: budget.exceeded
Datadog integration¶
Export cost metrics to Datadog for dashboards and alerting:
Metrics exported: bernstein.cost.total, bernstein.cost.per_task, bernstein.tokens.input, bernstein.tokens.output.
Cost comparison: Bernstein vs. manual¶
For a typical 50-task feature:
| Approach | Time | API cost | Total cost (incl. engineer time) |
|---|---|---|---|
| Manual (engineer + Copilot) | 40 hours | ~$5 | ~$4000+ |
| Bernstein (mixed models) | 2-4 hours | ~$20-40 | ~$200-400 |
| Bernstein (all opus) | 2-4 hours | ~$60-100 | ~$260-500 |
The cost savings come from parallelization and reduced engineer time, not cheaper API calls. Optimize for elapsed time first, API cost second.