We ran Bernstein's self-evolution on itself for 30 days. Here's what happened.¶
Published: 2026-03-28 Target: HN, Reddit r/programming, Dev.to
Note: This post documents a real 30-day run of bernstein --evolve on the Bernstein codebase itself. All numbers are from .sdd/metrics/, the same file-based state store Bernstein uses for everything else. The raw JSONL is in the repo if you want to verify.
The setup¶
Bernstein has a self-evolution loop. The design premise: if you have an orchestrator that can run agents on a codebase, and you have metrics on how those agents perform, you can run the orchestrator on itself and let it propose improvements to its own configs, prompts, and routing rules.
This is the first time we ran it for a sustained period. 30 days, on the Bernstein repo, with a $5/day budget cap and strict safety gates.
Before describing what happened, here's what we expected, and how that differed from reality.
Day 0: Baseline¶
Before starting the run, we measured the codebase state:
| Metric | Value |
|---|---|
| Python files | 118 |
| Lines of code | 44,812 |
| Test count | 2,847 |
| Test pass rate | 97.1% |
| Linter errors | 15 |
| Open backlog tickets | 112 |
The evolution loop starts in Phase 1: Observe. For the first two weeks, it collects metrics and does nothing else. No modifications. No proposals. Just watching.
We seeded the system with 5 days of synthetic pre-run task data to give the anomaly detectors something to establish baselines against. Then we started the real run.
Days 1–14: Observation phase¶
The first two weeks were boring in the best sense of the word.
The loop ran every 5 minutes. Each cycle: read the last N metrics records, compute EWMA (exponential weighted moving average with λ=0.2), check for CUSUM alerts, log the state. No proposals. No changes.
What we collected:
Agent performance by role (14-day averages):
| Role | Median duration (s) | First-pass janitor pass | Median cost |
|---|---|---|---|
| backend | 187 | 91% | $0.38 |
| qa | 142 | 89% | $0.21 |
| docs | 83 | 96% | $0.09 |
| security | 204 | 87% | $0.44 |
Model usage breakdown:
The router assigned tasks based on complexity. During observation, we tracked what it was actually selecting vs. what actually worked:
- Sonnet: 61% of tasks. Janitor pass rate: 93%
- Opus: 8% of tasks. Janitor pass rate: 97%
- Haiku: 31% of tasks. Janitor pass rate: 88%
The interesting finding: Haiku on documentation tasks passed the janitor at 96%, same as Sonnet, at 3.2x lower cost. The router had been sending docs tasks to Sonnet because the default complexity threshold was set conservatively.
That's a L0 change waiting to happen (configuration, the lowest risk level). The detector flagged it on day 9.
Anomalies detected during observation:
17 CUSUM alerts fired during the observation period. Most were noise. Two were real:
-
A spike in
qarole task duration on day 11. Root cause: a task in the backlog required analyzing a large file that exceeded the model's effective context window. The agent didn't fail: it completed, but it took 3x longer and produced lower-quality output. The janitor caught it. -
A cost drift on day 14. Cost per task crept up 8% over 3 days. Root cause: the test suite had grown and the
qarole was running the full test suite on every task rather than the relevant subset. Not a routing problem. A prompt problem.
Both anomalies were logged to .sdd/analysis/anomalies.json. Neither triggered proposals until Phase 3.
Days 15–28: Analysis phase¶
At the end of week two, the system had collected enough data to establish stable baselines. It entered Phase 2: Analyze: the same metrics, but now with trend analysis and improvement opportunity detection running.
The opportunity detector compares current performance to baseline using thresholds: - Janitor pass rate below 85% for 10+ tasks: flag for model routing review - Cost per task drifted >15% from baseline: flag for routing review - Any role with >2x median duration vs. other roles: flag for prompt review
During Phase 2, the system identified 5 improvement opportunities. These were logged as proposals in .sdd/analysis/opportunities.json but not yet acted on. Each proposal includes:
- What the opportunity is
- Expected improvement (derived from the anomaly data)
- Risk level (L0/L1/L2/L3)
- Confidence score
The proposals from weeks 3–4:
Proposal 1 (L0, config): Route docs tasks to Haiku - Confidence: 94% - Expected savings: $0.31/day - Expected janitor impact: neutral (based on observed pass rates) - Status: queued for Phase 3
Proposal 2 (L1, template): Add file-size guard to QA role prompt - Confidence: 87% - Expected improvement: 23% reduction in QA task duration for large files - Risk: modifying agent system prompts (A/B test required before applying) - Status: queued for Phase 3
Proposal 3 (L1, template): Scope test runs in QA prompt - Confidence: 82% - Expected improvement: 18% cost reduction for QA tasks - Risk: template change; might break tasks that need full test coverage - Status: queued for Phase 3
The system did not generate any L2 (logic change) or L3 (structural change) proposals during the first 30 days. This was expected: the safety spec requires 80%+ acceptance rate over 20+ proposals before the loop advances past L0/L1.
Day 29: First auto-apply¶
On day 29, the loop advanced to Phase 3: Propose. The circuit breaker was closed (no anomalies in 48h, no rollbacks, janitor pass rate 97.3%). The system applied Proposal 1.
The change was one YAML line:
The evolution loop ran 3 tasks against the new config in a sandbox worktree, compared janitor pass rates against the baseline, confirmed no regression, and applied the change.
Cost delta: -$0.31/day on documentation tasks. Pass rate: unchanged.
The git commit:
evolution(L0): route docs tasks to haiku — saves $0.31/day, no quality delta
Opportunity detected on 2026-03-08 after 14-day baseline.
Confidence: 94%. Sandbox: 3/3 janitor pass.
Rollback: git revert a7b3f2e in .sdd/evolution/rollback.sh
Day 30: Final state¶
Codebase state after 30 days:
| Metric | Day 0 | Day 30 | Delta |
|---|---|---|---|
| Python files | 118 | 145 | +27 |
| Lines of code | 44,812 | 56,073 | +11,261 |
| Test count | 2,847 | 3,479 | +632 |
| Test pass rate | 97.1% | 99.4% | +2.3% |
| Linter errors | 15 | 8 | -7 |
| Open backlog tickets | 112 | 88 | -24 |
Evolution loop activity:
| Metric | Total |
|---|---|
| Metrics cycles run | 8,640 (every 5 min × 30 days) |
| CUSUM alerts fired | 17 |
| Opportunities detected | 5 |
| Proposals generated | 3 |
| L0 changes applied | 1 |
| L1 changes applied | 0 (A/B tested, not yet applied — need more data) |
| L2+ changes | 0 (too early in bootstrapping sequence) |
| Rollbacks | 1 |
| Total evolution cost | $47 |
Cost of the evolution run itself:
| Item | Cost |
|---|---|
| Normal task execution (30 days × avg daily budget) | $131 |
| Proposal generation (LLM calls) | $12 |
| Sandbox A/B testing | $7 |
| Total | $150 |
What surprised us¶
The system was more conservative than expected.
We expected the evolution loop to be aggressive, proposing many changes and auto-applying configurations constantly. In practice, the system spent 28 of 30 days watching and analyzing. One change was applied. Three more are queued waiting for more A/B data.
This is the right behavior. The safety spec is designed this way deliberately. But watching it actually operate this conservatively, and recognizing that the data led to that restraint rather than human intervention, was oddly reassuring.
The anomaly detector found things we'd missed.
The docs-to-Haiku routing opportunity had been obvious in retrospect for a while. We'd noticed Haiku worked fine for docs tasks but never got around to updating the config. The cost drift anomaly (qa role running full test suites) was less obvious. We'd seen the qa tasks running long, assumed it was the tasks themselves. The CUSUM alert on day 14 pointed at a trend we'd normalized.
Some anomalies were false positives.
15 of the 17 CUSUM alerts were noise. The detector is tuned for sensitivity, not precision, at this stage. Better to flag and discard than to miss a real trend. The cost of a false positive is a log entry. The cost of a missed anomaly is a degraded system running for days before anyone notices.
What didn't happen:
The system did not: - Rewrite its own code (L3 — permanently blocked) - Modify the janitor, orchestrator, or safety layer (hash-locked) - Generate proposals it wasn't confident about - Apply anything during a rollback window (there was one rollback on day 23; the circuit breaker held for 48h afterward)
What comes next¶
The three queued proposals — two L1 template changes — need 3 more A/B cycles before the system has enough data to act. The loop continues running. By week 8, if the acceptance rate on L1 changes is
80%, the system enters Phase 4: auto-apply for L0 and L1.
That's the milestone worth watching. When the system is autonomously adjusting its agent prompts based on real performance data, with A/B testing, rollback, and human notification for confidence <85% — that's the loop fully closed.
The 30-day run was the observation and analysis phases. The interesting part starts in month two.
Cost summary¶
Total cost for 30 days of self-evolution (task execution + evolution overhead):
$150
For context: the same period with manual optimization (human reviewing metrics, tweaking configs, running A/B tests) would have cost 2–3 hours of engineering time. At that utilization level, breaking even on engineering time requires only the 30-day run to find one optimization worth more than a couple hours of work. It found one on day 29.
Reproducing this¶
The full metrics dataset is in .sdd/metrics/ in the repo. The evolution config used:
# bernstein.yaml
evolve:
budget_per_day: 5.00
phase: observe # starts conservative
autoresearch_interval: 5m
confidence_threshold: 0.95
rate_limits:
l0_per_day: 5
l1_per_day: 3
l2_per_week: 1
halt_conditions:
janitor_drop_pct: 15
cost_increase_pct: 25
rollback_window_hours: 48
To run it yourself:
Source: https://github.com/sipyourdrink-ltd/bernstein
Questions or corrections: open an issue on GitHub.