Bernstein Design¶

This document describes the current architecture of Bernstein as implemented in the codebase today, with explicit boundaries for partial features.

Core design principles¶

Short-lived workers: agents are spawned for focused work and then exit.
File-first state: runtime state is persisted under .sdd/.
Deterministic orchestration: scheduling and lifecycle decisions are code-driven.
Verification before closure: task completion passes through janitor/quality logic.
Multi-adapter runtime: Bernstein is CLI-agent agnostic via adapter interfaces.

High-level architecture¶

CLI (src/bernstein/cli/)
  -> Task server (src/bernstein/core/server.py shim -> core/server/)
    -> Route modules (src/bernstein/core/routes/)
      -> Store + lifecycle + orchestration (core/ sub-packages)
        -> Adapter-based process spawning (adapters/)

Since v1.6, core/ is organized into ~60 sub-packages. Top-level modules like core/server.py, core/orchestrator.py, core/spawner.py, core/task_lifecycle.py, and core/models.py are thin re-export shims that redirect to their sub-packages.

Primary orchestration modules:

src/bernstein/core/orchestrator.py (shim) -> src/bernstein/core/orchestration/orchestrator.py
src/bernstein/core/orchestration/tick_pipeline.py
src/bernstein/core/tasks/task_lifecycle.py
src/bernstein/core/agents/agent_lifecycle.py

Key runtime subsystems (in sub-packages):

Routing/cost: core/routing/router.py, core/routing/cascade_router.py, core/cost/cost.py, core/cost/cost_history.py, core/cost/cost_anomaly.py
Reliability: core/agents/heartbeat.py, core/cost/completion_budget.py, core/observability/loop_detector.py
Verification: core/quality/janitor.py, core/quality/quality_gates.py, core/security/approval.py, core/quality/reviewer.py
Context and memory: core/agents/spawn_prompt.py, core/tokens/context.py, core/knowledge/lessons.py, core/knowledge/knowledge_base.py, core/knowledge/rag.py

API surface (current)¶

The task server composes router modules from src/bernstein/core/routes/, including:

tasks.py
status.py
agents.py
costs.py
dashboard.py
quality.py
plans.py
graduation.py
webhooks.py
slack.py
auth.py
observability.py

Notable implemented endpoint groups:

Task CRUD, claims, completion/fail, dependencies graph
Agent heartbeats and process/session inspection
Cluster node registration/heartbeat/status/task-steal primitives
Status/events/metrics (including Prometheus-compatible metrics endpoint)
Cost and quality reporting endpoints
Trigger/webhook ingestion routes

Trigger architecture¶

Trigger orchestration is implemented and centered on:

src/bernstein/core/orchestration/trigger_manager.py
src/bernstein/core/tasks/models.py (TriggerEvent, trigger config models)

Current source adapters:

src/bernstein/core/trigger_sources/github.py
src/bernstein/core/trigger_sources/gitlab.py
src/bernstein/core/trigger_sources/slack.py
src/bernstein/core/trigger_sources/discord.py
src/bernstein/core/trigger_sources/file_watch.py
src/bernstein/core/trigger_sources/webhook.py
src/bernstein/core/trigger_sources/webhook_node.py
src/bernstein/core/trigger_sources/routine.py
src/bernstein/core/trigger_sources/schedule.py

Configuration source:

.sdd/config/triggers.yaml

Boundary: trigger infrastructure is real and usable, but project-specific rule libraries and operational runbooks are still evolving.

Cluster and remote execution¶

Implemented pieces:

Worker CLI: src/bernstein/cli/commands/worker_cmd.py
Cluster data model/policy: src/bernstein/core/protocols/cluster/
Cluster API routes in src/bernstein/core/routes/task_cluster.py and src/bernstein/core/routes/tasks.py

Boundary:

Distributed operation works as an advanced deployment pattern.
It is not presented as a fully managed autoscaling platform.

Plugins and extensibility¶

Plugin system is pluggy-based and implemented under:

src/bernstein/plugins/hookspecs.py
src/bernstein/plugins/manager.py

Current hooks include task/agent/evolution lifecycle callbacks.

Boundary:

Hook surface is stable for common extensions.
Advanced plugin packaging/marketplace workflows are still light on guardrails.

Observability and telemetry¶

Implemented:

Status/event streaming routes
Prometheus metrics export
Cost and quality metrics files under .sdd/metrics/
Observability route module for heartbeat/stall insights
OTLP telemetry configuration hooks in core models/bootstrap path

Boundary:

Prometheus and OTLP are real integrations.
Turnkey production dashboards/alert packs are not bundled.

Evolution and planning¶

Implemented:

Evolution package (src/bernstein/evolution/)
Plan execution and approval modules (core/planning/planner.py, core/security/plan_approval.py, plan routes)
Retrospective/reporting command path (retro)

Boundary:

End-to-end autonomous self-evolution exists with safety controls, but should be treated as operator-supervised in production settings.

`.sdd/` state model (current)¶

Common active paths:

.sdd/backlog/open|claimed|closed/
.sdd/runtime/
.sdd/metrics/
.sdd/traces/
.sdd/memory/
.sdd/caching/
.sdd/agents/

Exact files vary by enabled features and run mode.

Lifecycle state machines¶

All task and agent status changes are governed by a deterministic FSM in src/bernstein/core/tasks/lifecycle.py. Every transition is validated against an explicit table; illegal moves raise IllegalTransitionError and emit a typed LifecycleEvent for audit and replay.

See LIFECYCLE.md for the full state tables, transition metadata, TransitionReason/AbortReason enumerations, and abort-chain hierarchy.

Task FSM (12 states)¶

stateDiagram-v2
    [*] --> OPEN : dynamic creation
    [*] --> PLANNED : plan mode

    PLANNED --> OPEN : approved
    PLANNED --> CANCELLED : rejected

    OPEN --> CLAIMED : agent claims task
    OPEN --> WAITING_FOR_SUBTASKS : decomposed before claim
    OPEN --> CANCELLED : manual cancel

    CLAIMED --> IN_PROGRESS : agent starts work
    CLAIMED --> OPEN : unclaim / force-reassign
    CLAIMED --> DONE : fast completion
    CLAIMED --> FAILED : immediate failure
    CLAIMED --> CANCELLED : manual cancel
    CLAIMED --> WAITING_FOR_SUBTASKS : agent splits work
    CLAIMED --> BLOCKED : dependency discovered

    IN_PROGRESS --> DONE : agent reports success
    IN_PROGRESS --> FAILED : agent reports failure
    IN_PROGRESS --> BLOCKED : dependency discovered
    IN_PROGRESS --> WAITING_FOR_SUBTASKS : agent decomposes task
    IN_PROGRESS --> OPEN : requeue / force-reassign
    IN_PROGRESS --> CANCELLED : manual cancel
    IN_PROGRESS --> ORPHANED : heartbeat timeout / crash

    ORPHANED --> DONE : partial work merged
    ORPHANED --> FAILED : unrecoverable
    ORPHANED --> OPEN : requeued for retry

    BLOCKED --> OPEN : dependency resolved
    BLOCKED --> CANCELLED : manual cancel

    WAITING_FOR_SUBTASKS --> DONE : all subtasks completed
    WAITING_FOR_SUBTASKS --> BLOCKED : subtask timeout escalation
    WAITING_FOR_SUBTASKS --> CANCELLED : manual cancel

    FAILED --> OPEN : retry (within max_retries)

    DONE --> CLOSED : janitor verified + merged
    DONE --> FAILED : verification rejected

    CLOSED --> [*]
    CANCELLED --> [*]

    %% PENDING_APPROVAL has no FSM-managed transitions - set directly by the approval subsystem.
    PENDING_APPROVAL --> [*]

Note - PENDING_APPROVAL: Set directly by the approval subsystem; has no entry or exit in TASK_TRANSITIONS. See LIFECYCLE.md for details.

Agent FSM (4 states)¶

stateDiagram-v2
    [*] --> starting : spawn()

    starting --> working : process confirmed alive
    starting --> dead : spawn failure / fast exit

    working --> idle : task completed, awaiting reuse
    working --> dead : crash / kill / timeout / circuit break

    idle --> working : new task assigned
    idle --> dead : idle recycled (resource reclaim)

    dead --> [*]

Agent Turn FSM (10 states)¶

Tracks the lifecycle of a single task-handling turn within an agent process. Source: src/bernstein/core/agents/agent_turn_state.py.

stateDiagram-v2
    [*] --> IDLE

    IDLE --> CLAIMING : task_claimed

    CLAIMING --> SPAWNING : agent_spawned
    CLAIMING --> FAILED : task_failed

    SPAWNING --> RUNNING : agent_spawned
    SPAWNING --> FAILED : task_failed

    RUNNING --> TOOL_USE : tool_started
    RUNNING --> COMPACTING : compact_needed
    RUNNING --> VERIFYING : verify_requested
    RUNNING --> FAILED : task_failed

    TOOL_USE --> RUNNING : tool_completed
    TOOL_USE --> FAILED : task_failed

    COMPACTING --> RUNNING : verify_requested
    COMPACTING --> FAILED : task_failed

    VERIFYING --> COMPLETING : task_completed
    VERIFYING --> RUNNING : compact_needed
    VERIFYING --> FAILED : task_failed

    COMPLETING --> REAPED : agent_reaped

    FAILED --> REAPED : agent_reaped

    REAPED --> [*]

See LIFECYCLE.md for the full transition table and events reference.

Non-goals for this document¶

This file is not a roadmap backlog.
This file is not a generated protocol matrix.
This file is not a per-command CLI reference (see getting-started/install.md and bernstein --help).

Bernstein Design¶

Core design principles¶

High-level architecture¶

API surface (current)¶

Trigger architecture¶

Cluster and remote execution¶

Plugins and extensibility¶

Observability and telemetry¶

Evolution and planning¶

.sdd/ state model (current)¶

Lifecycle state machines¶

Task FSM (12 states)¶

Agent FSM (4 states)¶

Agent Turn FSM (10 states)¶

Non-goals for this document¶

`.sdd/` state model (current)¶