Bernstein Design¶
This document describes the current architecture of Bernstein as implemented in the codebase today, with explicit boundaries for partial features.
Core design principles¶
- Short-lived workers: agents are spawned for focused work and then exit.
- File-first state: runtime state is persisted under
.sdd/. - Deterministic orchestration: scheduling and lifecycle decisions are code-driven.
- Verification before closure: task completion passes through janitor/quality logic.
- Multi-adapter runtime: Bernstein is CLI-agent agnostic via adapter interfaces.
High-level architecture¶
CLI (src/bernstein/cli/)
-> Task server (src/bernstein/core/server.py shim -> core/server/)
-> Route modules (src/bernstein/core/routes/)
-> Store + lifecycle + orchestration (core/ sub-packages)
-> Adapter-based process spawning (adapters/)
Since v1.6, core/ is organized into 22 sub-packages. Top-level modules like core/server.py, core/orchestrator.py, core/spawner.py, core/task_lifecycle.py, and core/models.py are thin re-export shims that redirect to their sub-packages.
Primary orchestration modules:
src/bernstein/core/orchestrator.py(shim) ->src/bernstein/core/orchestration/orchestrator.pysrc/bernstein/core/orchestration/tick_pipeline.pysrc/bernstein/core/tasks/task_lifecycle.pysrc/bernstein/core/agents/agent_lifecycle.py
Key runtime subsystems (in sub-packages):
- Routing/cost:
core/routing/router.py,core/routing/cascade_router.py,core/cost/cost.py,core/cost/cost_history.py,core/cost/cost_anomaly.py - Reliability:
core/agents/heartbeat.py,core/cost/completion_budget.py,core/observability/loop_detector.py - Verification:
core/quality/janitor.py,core/quality/quality_gates.py,core/security/approval.py,core/quality/reviewer.py - Context and memory:
core/agents/spawn_prompt.py,core/tokens/context.py,core/knowledge/lessons.py,core/knowledge/knowledge_base.py,core/knowledge/rag.py
API surface (current)¶
The task server composes router modules from src/bernstein/core/routes/, including:
tasks.pystatus.pyagents.pycosts.pydashboard.pyquality.pyplans.pygraduation.pywebhooks.pyslack.pyauth.pyobservability.py
Notable implemented endpoint groups:
- Task CRUD, claims, completion/fail, dependencies graph
- Agent heartbeats and process/session inspection
- Cluster node registration/heartbeat/status/task-steal primitives
- Status/events/metrics (including Prometheus-compatible metrics endpoint)
- Cost and quality reporting endpoints
- Trigger/webhook ingestion routes
Trigger architecture¶
Trigger orchestration is implemented and centered on:
src/bernstein/core/orchestration/trigger_manager.pysrc/bernstein/core/tasks/models.py(TriggerEvent, trigger config models)
Current source adapters:
src/bernstein/core/trigger_sources/github.pysrc/bernstein/core/trigger_sources/gitlab.pysrc/bernstein/core/trigger_sources/slack.pysrc/bernstein/core/trigger_sources/discord.pysrc/bernstein/core/trigger_sources/file_watch.pysrc/bernstein/core/trigger_sources/webhook.py
Configuration source:
.sdd/config/triggers.yaml
Boundary: trigger infrastructure is real and usable, but project-specific rule libraries and operational runbooks are still evolving.
Cluster and remote execution¶
Implemented pieces:
- Worker CLI:
src/bernstein/cli/commands/worker_cmd.py - Cluster data model/policy:
src/bernstein/core/protocols/cluster.py - Cluster API routes in
src/bernstein/core/routes/task_cluster.pyandsrc/bernstein/core/routes/tasks.py
Boundary:
- Distributed operation works as an advanced deployment pattern.
- It is not presented as a fully managed autoscaling platform.
Plugins and extensibility¶
Plugin system is pluggy-based and implemented under:
src/bernstein/plugins/hookspecs.pysrc/bernstein/plugins/manager.py
Current hooks include task/agent/evolution lifecycle callbacks.
Boundary:
- Hook surface is stable for common extensions.
- Advanced plugin packaging/marketplace workflows are still light on guardrails.
Observability and telemetry¶
Implemented:
- Status/event streaming routes
- Prometheus metrics export
- Cost and quality metrics files under
.sdd/metrics/ - Observability route module for heartbeat/stall insights
- OTLP telemetry configuration hooks in core models/bootstrap path
Boundary:
- Prometheus and OTLP are real integrations.
- Turnkey production dashboards/alert packs are not bundled.
Evolution and planning¶
Implemented:
- Evolution package (
src/bernstein/evolution/) - Plan execution and approval modules (
core/planning/planner.py,core/security/plan_approval.py, plan routes) - Retrospective/reporting command path (
retro)
Boundary:
- End-to-end autonomous self-evolution exists with safety controls, but should be treated as operator-supervised in production settings.
.sdd/ state model (current)¶
Common active paths:
.sdd/backlog/open|claimed|closed/.sdd/runtime/.sdd/metrics/.sdd/traces/.sdd/memory/.sdd/caching/.sdd/agents/
Exact files vary by enabled features and run mode.
Lifecycle state machines¶
All task and agent status changes are governed by a deterministic FSM in src/bernstein/core/tasks/lifecycle.py. Every transition is validated against an explicit table; illegal moves raise IllegalTransitionError and emit a typed LifecycleEvent for audit and replay.
See LIFECYCLE.md for the full state tables, transition metadata, TransitionReason/AbortReason enumerations, and abort-chain hierarchy.
Task FSM (12 states)¶
stateDiagram-v2
[*] --> OPEN : dynamic creation
[*] --> PLANNED : plan mode
PLANNED --> OPEN : approved
PLANNED --> CANCELLED : rejected
OPEN --> CLAIMED : agent claims task
OPEN --> WAITING_FOR_SUBTASKS : decomposed before claim
OPEN --> CANCELLED : manual cancel
CLAIMED --> IN_PROGRESS : agent starts work
CLAIMED --> OPEN : unclaim / force-reassign
CLAIMED --> DONE : fast completion
CLAIMED --> FAILED : immediate failure
CLAIMED --> CANCELLED : manual cancel
CLAIMED --> WAITING_FOR_SUBTASKS : agent splits work
CLAIMED --> BLOCKED : dependency discovered
IN_PROGRESS --> DONE : agent reports success
IN_PROGRESS --> FAILED : agent reports failure
IN_PROGRESS --> BLOCKED : dependency discovered
IN_PROGRESS --> WAITING_FOR_SUBTASKS : agent decomposes task
IN_PROGRESS --> OPEN : requeue / force-reassign
IN_PROGRESS --> CANCELLED : manual cancel
IN_PROGRESS --> ORPHANED : heartbeat timeout / crash
ORPHANED --> DONE : partial work merged
ORPHANED --> FAILED : unrecoverable
ORPHANED --> OPEN : requeued for retry
BLOCKED --> OPEN : dependency resolved
BLOCKED --> CANCELLED : manual cancel
WAITING_FOR_SUBTASKS --> DONE : all subtasks completed
WAITING_FOR_SUBTASKS --> BLOCKED : subtask timeout escalation
WAITING_FOR_SUBTASKS --> CANCELLED : manual cancel
FAILED --> OPEN : retry (within max_retries)
DONE --> CLOSED : janitor verified + merged
DONE --> FAILED : verification rejected
CLOSED --> [*]
CANCELLED --> [*]
%% PENDING_APPROVAL has no FSM-managed transitions — set directly by the approval subsystem.
PENDING_APPROVAL --> [*] Note —
PENDING_APPROVAL: Set directly by the approval subsystem; has no entry or exit inTASK_TRANSITIONS. See LIFECYCLE.md for details.
Agent FSM (4 states)¶
stateDiagram-v2
[*] --> starting : spawn()
starting --> working : process confirmed alive
starting --> dead : spawn failure / fast exit
working --> idle : task completed, awaiting reuse
working --> dead : crash / kill / timeout / circuit break
idle --> working : new task assigned
idle --> dead : idle recycled (resource reclaim)
dead --> [*] Agent Turn FSM (10 states)¶
Tracks the lifecycle of a single task-handling turn within an agent process. Source: src/bernstein/core/agents/agent_turn_state.py.
stateDiagram-v2
[*] --> IDLE
IDLE --> CLAIMING : task_claimed
CLAIMING --> SPAWNING : agent_spawned
CLAIMING --> FAILED : task_failed
SPAWNING --> RUNNING : agent_spawned
SPAWNING --> FAILED : task_failed
RUNNING --> TOOL_USE : tool_started
RUNNING --> COMPACTING : compact_needed
RUNNING --> VERIFYING : verify_requested
RUNNING --> FAILED : task_failed
TOOL_USE --> RUNNING : tool_completed
TOOL_USE --> FAILED : task_failed
COMPACTING --> RUNNING : verify_requested
COMPACTING --> FAILED : task_failed
VERIFYING --> COMPLETING : task_completed
VERIFYING --> RUNNING : compact_needed
VERIFYING --> FAILED : task_failed
COMPLETING --> REAPED : agent_reaped
FAILED --> REAPED : agent_reaped
REAPED --> [*] See LIFECYCLE.md for the full transition table and events reference.
Non-goals for this document¶
- This file is not a roadmap backlog.
- This file is not a generated protocol matrix.
- This file is not a per-command CLI reference (see
GETTING_STARTED.mdandbernstein --help).