ADR-003: Self-Evolution Feedback Loop Architecture¶
Status: Approved Date: 2026-03-22 Author: Bernstein Architecture Team Context: Self-improving multi-agent orchestration system
Problem Statement¶
Bernstein orchestrates multiple LLM agents working on software development tasks. Without a feedback mechanism:
- Performance degradation goes undetected — Agent success rates drop, costs increase, but no automatic correction occurs
- Improvement opportunities are missed — Better model routing, policy adjustments, and prompt optimizations require manual analysis
- System stagnation — The system cannot adapt to changing conditions (new providers, API changes, project evolution)
- Reactive rather than proactive — Humans must notice problems and manually fix them
Requirements¶
- Automatic metrics collection — Track task success, cost, latency, token usage, provider health
- Performance analysis — Detect trends, anomalies, and bottlenecks
- Upgrade decision logic — Determine when and how to improve the system
- Safe execution — Apply changes with rollback capability
- Continuous operation — Run in background without human intervention
Architecture¶
graph LR
Metrics["Metrics\nCollection"] --> Analysis["Analysis\nEngine"] --> Upgrade["Upgrade\nDecision"]
Upgrade --> Exec["Execution\nEngine"]
Exec --> Store["State Store\n(.sdd/metrics)"]
Exec --> Metrics Component 1: Metrics Collection¶
graph TD
subgraph Metrics Collection
TM["Task Metrics"] & AM["Agent Metrics"] & CM["Cost Metrics"]
TM & AM & CM --> Agg["Metrics Aggregator"]
Agg --> TS["Time-Series Storage (.sdd)"]
end Task Metrics: - task_duration_seconds: Time from spawn to completion - task_success_rate: Percentage passing janitor verification - task_rework_rate: Percentage requiring fix tasks - task_token_usage: Total tokens consumed - task_cost_usd: Dollar cost per task - files_modified: Number of files changed - lines_added_deleted: Code churn metrics
Agent Metrics: - agent_lifetime_seconds: Session duration - agent_tasks_completed: Tasks per session - agent_heartbeat_failures: Times heartbeat was missed - agent_sleep_incidents: Times agent stopped responding - agent_context_tokens: Context window utilization
Cost Metrics: - cost_per_provider: USD spent per LLM provider - cost_per_role: USD spent per agent role - cost_per_task: Average cost per completed task - free_tier_utilization: Percentage using free tiers - budget_remaining: Remaining budget for billing period
Quality Metrics: - janitor_pass_rate: First-pass verification success - human_approval_rate: Percentage accepted without review - rollback_rate: Percentage of changes reverted - test_pass_rate: Automated test success rate
Provider Health Metrics: - provider_status: healthy/degraded/unhealthy/rate_limited - provider_latency_ms: Average response time - provider_error_rate: Percentage of failed requests - quota_remaining: Free tier quota left
Component 2: Analysis Engine¶
graph TD
subgraph Analysis Engine
TD_["Trend Detector\n(7-day trends)"] & AD["Anomaly Detector\n(outliers)"]
TD_ & AD --> RCA["Root Cause Analyzer\nCorrelation · Bottleneck · Cost drivers"]
RCA --> IO["Improvement Opportunities\nRouting · Providers · Policy · Templates"]
end Analysis Algorithms:
- Trend Detection
- Rolling average comparison (current vs 7-day baseline)
- Linear regression for cost/performance trends
-
Change-point detection for sudden shifts
-
Anomaly Detection
- Z-score based outlier detection (threshold: |z| > 2.5)
- Isolation forest for multi-variate anomalies
-
Threshold-based alerts (e.g., cost spike > 50%)
-
Correlation Analysis
- Pearson correlation between metrics
- Identifies relationships (e.g., model choice → success rate)
-
Surfaces hidden dependencies
-
Bottleneck Identification
- Queue depth analysis per role
- Agent utilization rates
- Task completion rate by complexity
Component 3: Upgrade Decision Logic¶
graph TD
subgraph Upgrade Decision Logic
Trig["Trigger Conditions\nCost spike · Success drop · Degradation\nNew provider · Scheduled review"]
Trig --> Cat["Upgrade Categories"]
Cat --> PU["Policy Update"] & RR["Routing Rules"]
Cat --> MR["Model Routing"] & RT["Role Templates"]
PU & RR & MR & RT --> DC["Decision Criteria\nImprovement > threshold · Risk acceptable\nCost < savings · No conflicts"]
end Trigger Conditions:
| Trigger | Threshold | Action |
|---|---|---|
| Cost spike | >50% increase in 24h | Immediate review |
| Success rate drop | <80% for 10+ tasks | Model routing adjustment |
| Free tier available | New provider detected | Policy update |
| Budget threshold | >80% of monthly budget | Cost optimization |
| Scheduled review | Weekly/Monthly | Full system analysis |
Upgrade Categories:
- Policy Updates (Low Risk)
- Adjust provider switching thresholds
- Modify batch sizes
-
Update rate limit configurations
-
Routing Rules (Medium Risk)
- Change model selection criteria
- Add/remove provider preferences
-
Adjust effort level mappings
-
Model Routing (Medium Risk)
- Switch default models for roles
- Update complexity thresholds
-
Add new model providers
-
Role Templates (High Risk)
- Update system prompts
- Modify task prompt templates
- Change role configurations
Component 4: Execution Engine¶
graph LR
subgraph Execution Engine
V["Validate\nChange"] --> A["Apply\nChange"] --> Ve["Verify\nChange"]
Ve -->|fail| Al["Alert"] --> Mon["Monitor\nResults"] --> Rb["Rollback\nif needed"]
end Execution Flow:
- Validation
- Syntax check for YAML/JSON policy changes
- Dry-run simulation for routing changes
-
Backward compatibility verification
-
Application
- Atomic file writes with rollback capability
- Version control integration (git commit per change)
-
Notification to running agents
-
Verification
- Immediate metric check (did things improve?)
- A/B comparison with baseline
-
Rollback trigger if degradation detected
-
Monitoring
- Watch key metrics for 24h post-change
- Alert on unexpected side effects
- Log all changes for audit trail
Data Flow¶
Task Completion
│
▼
┌──────────────┐
│ Janitor │───▶ Pass/Fail + Metrics
└──────────────┘
│
▼
┌──────────────┐
│ Metrics │───▶ Append to .sdd/metrics/tasks.jsonl
│ Collector │
└──────────────┘
│
▼
┌──────────────┐
│ Analysis │───▶ Run every N tasks or T minutes
│ Scheduler │
└──────────────┘
│
▼
┌──────────────┐
│ Analysis │───▶ Identify patterns
│ Engine │
└──────────────┘
│
▼
┌──────────────┐
│ Upgrade │───▶ Decide on changes
│ Decision │
└──────────────┘
│
▼
┌──────────────┐
│ Execution │───▶ Apply + Verify
│ Engine │
└──────────────┘
│
▼
┌──────────────┐
│ Git Commit │───▶ Track changes
│ + Notify │
└──────────────┘
State Storage¶
All state lives in .sdd/ directory:
.sdd/
├── metrics/
│ ├── tasks.jsonl # Per-task metrics (append-only)
│ ├── agents.jsonl # Per-agent session metrics
│ ├── costs.jsonl # Cost tracking per provider
│ └── quality.jsonl # Quality metrics (janitor, tests)
├── analysis/
│ ├── trends.json # 7-day rolling trends
│ ├── anomalies.json # Detected anomalies
│ └── opportunities.json # Improvement suggestions
├── upgrades/
│ ├── pending.json # Upgrades awaiting approval
│ ├── applied.json # Recently applied upgrades
│ └── history.jsonl # Full upgrade history
└── config/
├── policies.yaml # Active policies
├── routing.yaml # Model routing rules
└── providers.yaml # Provider configurations
Metrics Schema¶
Task Metrics Record:
{
"timestamp": "2026-03-22T10:30:00Z",
"task_id": "PROJ-042",
"role": "backend",
"model": "sonnet",
"provider": "openrouter",
"duration_seconds": 180,
"tokens_prompt": 2500,
"tokens_completion": 1200,
"cost_usd": 0.0045,
"janitor_passed": true,
"files_modified": 3,
"lines_added": 45,
"lines_deleted": 12
}
Provider Cost Record:
{
"timestamp": "2026-03-22T10:30:00Z",
"provider": "openrouter",
"model": "sonnet",
"tier": "paid",
"tokens_in": 2500,
"tokens_out": 1200,
"cost_usd": 0.0045,
"rate_limit_remaining": 950,
"free_tier_remaining": 0
}
Upgrade Approval Modes¶
| Mode | Description | Use Case |
|---|---|---|
| Auto | Apply immediately | Low-risk policy tweaks |
| Human | Require approval | High-risk template changes |
| Hybrid | Auto if confidence >90%, else human | Most upgrades |
Implementation: EvolutionCoordinator¶
The EvolutionCoordinator class in src/bernstein/core/evolution.py implements this architecture:
coordinator = EvolutionCoordinator(
router=tier_aware_router,
hijacker=tier_hijacker,
metrics_collector=metrics_collector,
config=EvolutionConfig(
evaluation_interval_minutes=30,
min_tasks_for_evaluation=5,
auto_execute_low_priority=False,
)
)
coordinator.start() # Background evaluation loop
Key responsibilities: 1. Periodic performance evaluation (every 30 minutes) 2. Metrics aggregation and trend analysis 3. Upgrade recommendation generation 4. Task creation for implementation 5. History tracking for impact measurement
Alternatives Considered¶
Option A: Manual-only improvements¶
Humans analyze metrics and manually apply fixes.
Pros: Full control, no automation risk Cons: Slow, reactive, requires constant human attention
Verdict: Rejected. Defeats the purpose of self-evolution.
Option B: Full auto-pilot¶
System makes and applies all changes automatically.
Pros: Maximum automation, rapid iteration Cons: Risk of cascading errors, hard to debug
Verdict: Rejected for high-risk changes. Accepted for low-risk policy tweaks.
Option C: Hybrid (chosen)¶
Automatic analysis + human approval for high-risk changes.
Pros: Best balance of automation and control Cons: Requires human involvement for major changes
Verdict: Selected. Low-risk changes (policy tweaks) auto-apply. High-risk changes (prompt modifications) require approval.
Consequences¶
Positive¶
- Continuous improvement — System gets better over time without manual intervention
- Early problem detection — Anomalies caught before they become critical
- Cost optimization — Automatic switching to cheaper providers when possible
- Performance tuning — Router thresholds adjust based on real data
Risks¶
- Over-optimization — System might optimize for metrics at expense of quality
- Change fatigue — Too many automatic changes could destabilize development
- Debugging complexity — Harder to trace why a change was made
Mitigations¶
- Confidence thresholds — Only apply changes with >80% confidence
- Rate limiting — Maximum 2 concurrent upgrades
- Audit trail — All changes logged with rationale
- Rollback capability — Automatic revert if metrics degrade
References¶
- Implementation:
src/bernstein/core/evolution.py - Metrics:
src/bernstein/core/metrics.py - Policy Engine:
src/bernstein/core/policy.py