Skip to content

ADR-003: Self-Evolution Feedback Loop Architecture

Status: Approved Date: 2026-03-22 Author: Bernstein Architecture Team Context: Self-improving multi-agent orchestration system


Problem Statement

Bernstein orchestrates multiple LLM agents working on software development tasks. Without a feedback mechanism:

  1. Performance degradation goes undetected — Agent success rates drop, costs increase, but no automatic correction occurs
  2. Improvement opportunities are missed — Better model routing, policy adjustments, and prompt optimizations require manual analysis
  3. System stagnation — The system cannot adapt to changing conditions (new providers, API changes, project evolution)
  4. Reactive rather than proactive — Humans must notice problems and manually fix them

Requirements

  1. Automatic metrics collection — Track task success, cost, latency, token usage, provider health
  2. Performance analysis — Detect trends, anomalies, and bottlenecks
  3. Upgrade decision logic — Determine when and how to improve the system
  4. Safe execution — Apply changes with rollback capability
  5. Continuous operation — Run in background without human intervention

Architecture

graph LR
    Metrics["Metrics\nCollection"] --> Analysis["Analysis\nEngine"] --> Upgrade["Upgrade\nDecision"]
    Upgrade --> Exec["Execution\nEngine"]
    Exec --> Store["State Store\n(.sdd/metrics)"]
    Exec --> Metrics

Component 1: Metrics Collection

graph TD
    subgraph Metrics Collection
        TM["Task Metrics"] & AM["Agent Metrics"] & CM["Cost Metrics"]
        TM & AM & CM --> Agg["Metrics Aggregator"]
        Agg --> TS["Time-Series Storage (.sdd)"]
    end

Task Metrics: - task_duration_seconds: Time from spawn to completion - task_success_rate: Percentage passing janitor verification - task_rework_rate: Percentage requiring fix tasks - task_token_usage: Total tokens consumed - task_cost_usd: Dollar cost per task - files_modified: Number of files changed - lines_added_deleted: Code churn metrics

Agent Metrics: - agent_lifetime_seconds: Session duration - agent_tasks_completed: Tasks per session - agent_heartbeat_failures: Times heartbeat was missed - agent_sleep_incidents: Times agent stopped responding - agent_context_tokens: Context window utilization

Cost Metrics: - cost_per_provider: USD spent per LLM provider - cost_per_role: USD spent per agent role - cost_per_task: Average cost per completed task - free_tier_utilization: Percentage using free tiers - budget_remaining: Remaining budget for billing period

Quality Metrics: - janitor_pass_rate: First-pass verification success - human_approval_rate: Percentage accepted without review - rollback_rate: Percentage of changes reverted - test_pass_rate: Automated test success rate

Provider Health Metrics: - provider_status: healthy/degraded/unhealthy/rate_limited - provider_latency_ms: Average response time - provider_error_rate: Percentage of failed requests - quota_remaining: Free tier quota left

Component 2: Analysis Engine

graph TD
    subgraph Analysis Engine
        TD_["Trend Detector\n(7-day trends)"] & AD["Anomaly Detector\n(outliers)"]
        TD_ & AD --> RCA["Root Cause Analyzer\nCorrelation · Bottleneck · Cost drivers"]
        RCA --> IO["Improvement Opportunities\nRouting · Providers · Policy · Templates"]
    end

Analysis Algorithms:

  1. Trend Detection
  2. Rolling average comparison (current vs 7-day baseline)
  3. Linear regression for cost/performance trends
  4. Change-point detection for sudden shifts

  5. Anomaly Detection

  6. Z-score based outlier detection (threshold: |z| > 2.5)
  7. Isolation forest for multi-variate anomalies
  8. Threshold-based alerts (e.g., cost spike > 50%)

  9. Correlation Analysis

  10. Pearson correlation between metrics
  11. Identifies relationships (e.g., model choice → success rate)
  12. Surfaces hidden dependencies

  13. Bottleneck Identification

  14. Queue depth analysis per role
  15. Agent utilization rates
  16. Task completion rate by complexity

Component 3: Upgrade Decision Logic

graph TD
    subgraph Upgrade Decision Logic
        Trig["Trigger Conditions\nCost spike · Success drop · Degradation\nNew provider · Scheduled review"]
        Trig --> Cat["Upgrade Categories"]
        Cat --> PU["Policy Update"] & RR["Routing Rules"]
        Cat --> MR["Model Routing"] & RT["Role Templates"]
        PU & RR & MR & RT --> DC["Decision Criteria\nImprovement > threshold · Risk acceptable\nCost < savings · No conflicts"]
    end

Trigger Conditions:

Trigger Threshold Action
Cost spike >50% increase in 24h Immediate review
Success rate drop <80% for 10+ tasks Model routing adjustment
Free tier available New provider detected Policy update
Budget threshold >80% of monthly budget Cost optimization
Scheduled review Weekly/Monthly Full system analysis

Upgrade Categories:

  1. Policy Updates (Low Risk)
  2. Adjust provider switching thresholds
  3. Modify batch sizes
  4. Update rate limit configurations

  5. Routing Rules (Medium Risk)

  6. Change model selection criteria
  7. Add/remove provider preferences
  8. Adjust effort level mappings

  9. Model Routing (Medium Risk)

  10. Switch default models for roles
  11. Update complexity thresholds
  12. Add new model providers

  13. Role Templates (High Risk)

  14. Update system prompts
  15. Modify task prompt templates
  16. Change role configurations

Component 4: Execution Engine

graph LR
    subgraph Execution Engine
        V["Validate\nChange"] --> A["Apply\nChange"] --> Ve["Verify\nChange"]
        Ve -->|fail| Al["Alert"] --> Mon["Monitor\nResults"] --> Rb["Rollback\nif needed"]
    end

Execution Flow:

  1. Validation
  2. Syntax check for YAML/JSON policy changes
  3. Dry-run simulation for routing changes
  4. Backward compatibility verification

  5. Application

  6. Atomic file writes with rollback capability
  7. Version control integration (git commit per change)
  8. Notification to running agents

  9. Verification

  10. Immediate metric check (did things improve?)
  11. A/B comparison with baseline
  12. Rollback trigger if degradation detected

  13. Monitoring

  14. Watch key metrics for 24h post-change
  15. Alert on unexpected side effects
  16. Log all changes for audit trail

Data Flow

Task Completion
┌──────────────┐
│  Janitor     │───▶ Pass/Fail + Metrics
└──────────────┘
┌──────────────┐
│  Metrics     │───▶ Append to .sdd/metrics/tasks.jsonl
│  Collector   │
└──────────────┘
┌──────────────┐
│  Analysis    │───▶ Run every N tasks or T minutes
│  Scheduler   │
└──────────────┘
┌──────────────┐
│  Analysis    │───▶ Identify patterns
│  Engine      │
└──────────────┘
┌──────────────┐
│  Upgrade     │───▶ Decide on changes
│  Decision    │
└──────────────┘
┌──────────────┐
│  Execution   │───▶ Apply + Verify
│  Engine      │
└──────────────┘
┌──────────────┐
│  Git Commit  │───▶ Track changes
│  + Notify    │
└──────────────┘

State Storage

All state lives in .sdd/ directory:

.sdd/
├── metrics/
│   ├── tasks.jsonl          # Per-task metrics (append-only)
│   ├── agents.jsonl         # Per-agent session metrics
│   ├── costs.jsonl          # Cost tracking per provider
│   └── quality.jsonl        # Quality metrics (janitor, tests)
├── analysis/
│   ├── trends.json          # 7-day rolling trends
│   ├── anomalies.json       # Detected anomalies
│   └── opportunities.json   # Improvement suggestions
├── upgrades/
│   ├── pending.json         # Upgrades awaiting approval
│   ├── applied.json         # Recently applied upgrades
│   └── history.jsonl        # Full upgrade history
└── config/
    ├── policies.yaml        # Active policies
    ├── routing.yaml         # Model routing rules
    └── providers.yaml       # Provider configurations

Metrics Schema

Task Metrics Record:

{
  "timestamp": "2026-03-22T10:30:00Z",
  "task_id": "PROJ-042",
  "role": "backend",
  "model": "sonnet",
  "provider": "openrouter",
  "duration_seconds": 180,
  "tokens_prompt": 2500,
  "tokens_completion": 1200,
  "cost_usd": 0.0045,
  "janitor_passed": true,
  "files_modified": 3,
  "lines_added": 45,
  "lines_deleted": 12
}

Provider Cost Record:

{
  "timestamp": "2026-03-22T10:30:00Z",
  "provider": "openrouter",
  "model": "sonnet",
  "tier": "paid",
  "tokens_in": 2500,
  "tokens_out": 1200,
  "cost_usd": 0.0045,
  "rate_limit_remaining": 950,
  "free_tier_remaining": 0
}


Upgrade Approval Modes

Mode Description Use Case
Auto Apply immediately Low-risk policy tweaks
Human Require approval High-risk template changes
Hybrid Auto if confidence >90%, else human Most upgrades

Implementation: EvolutionCoordinator

The EvolutionCoordinator class in src/bernstein/core/evolution.py implements this architecture:

coordinator = EvolutionCoordinator(
    router=tier_aware_router,
    hijacker=tier_hijacker,
    metrics_collector=metrics_collector,
    config=EvolutionConfig(
        evaluation_interval_minutes=30,
        min_tasks_for_evaluation=5,
        auto_execute_low_priority=False,
    )
)

coordinator.start()  # Background evaluation loop

Key responsibilities: 1. Periodic performance evaluation (every 30 minutes) 2. Metrics aggregation and trend analysis 3. Upgrade recommendation generation 4. Task creation for implementation 5. History tracking for impact measurement


Alternatives Considered

Option A: Manual-only improvements

Humans analyze metrics and manually apply fixes.

Pros: Full control, no automation risk Cons: Slow, reactive, requires constant human attention

Verdict: Rejected. Defeats the purpose of self-evolution.

Option B: Full auto-pilot

System makes and applies all changes automatically.

Pros: Maximum automation, rapid iteration Cons: Risk of cascading errors, hard to debug

Verdict: Rejected for high-risk changes. Accepted for low-risk policy tweaks.

Option C: Hybrid (chosen)

Automatic analysis + human approval for high-risk changes.

Pros: Best balance of automation and control Cons: Requires human involvement for major changes

Verdict: Selected. Low-risk changes (policy tweaks) auto-apply. High-risk changes (prompt modifications) require approval.


Consequences

Positive

  • Continuous improvement — System gets better over time without manual intervention
  • Early problem detection — Anomalies caught before they become critical
  • Cost optimization — Automatic switching to cheaper providers when possible
  • Performance tuning — Router thresholds adjust based on real data

Risks

  • Over-optimization — System might optimize for metrics at expense of quality
  • Change fatigue — Too many automatic changes could destabilize development
  • Debugging complexity — Harder to trace why a change was made

Mitigations

  • Confidence thresholds — Only apply changes with >80% confidence
  • Rate limiting — Maximum 2 concurrent upgrades
  • Audit trail — All changes logged with rationale
  • Rollback capability — Automatic revert if metrics degrade

References

  • Implementation: src/bernstein/core/evolution.py
  • Metrics: src/bernstein/core/metrics.py
  • Policy Engine: src/bernstein/core/policy.py