Lifecycle State Machines¶
Bernstein uses deterministic finite state machines (FSMs) for both task and agent lifecycle management. All transitions flow through the Lifecycle Governance Kernel (core/tasks/lifecycle.py), which validates transitions against an explicit transition table, rejects illegal moves with IllegalTransitionError, and emits typed LifecycleEvent records for audit, replay, and metrics.
Source of truth: src/bernstein/core/tasks/lifecycle.py (transition tables), src/bernstein/core/tasks/models.py (TaskStatus enum, AgentSession dataclass).
Task States (12 states)¶
| Status | Description |
|---|---|
PLANNED | Awaiting human approval before execution (plan mode). Tasks created from plan YAML files start here. |
OPEN | Ready for an agent to claim. The default starting state for dynamically created tasks. |
CLAIMED | An agent has claimed this task but has not yet started work. |
IN_PROGRESS | Agent is actively working on the task. |
DONE | Agent reported completion. Pending janitor verification and merge. |
CLOSED | Verified and merged. Terminal state. |
FAILED | Agent reported failure or verification rejected the result. Can be retried. |
BLOCKED | Waiting on an external dependency (another task, resource, or approval). |
WAITING_FOR_SUBTASKS | Parent task waiting for child subtasks to complete (agent decomposed work). |
CANCELLED | Manually or programmatically cancelled. Terminal state. |
ORPHANED | Agent crashed mid-task; pending crash recovery by the orchestrator. |
PENDING_APPROVAL | Task completed but requires human approval before taking effect. |
Task State Diagram¶
stateDiagram-v2
[*] --> PLANNED : plan mode
[*] --> OPEN : dynamic creation
PLANNED --> OPEN : approved
PLANNED --> CANCELLED : rejected
OPEN --> CLAIMED : agent claims task
OPEN --> WAITING_FOR_SUBTASKS : decomposed before claim
OPEN --> CANCELLED : manual cancel
CLAIMED --> IN_PROGRESS : agent starts work
CLAIMED --> OPEN : unclaim / force-reassign
CLAIMED --> DONE : fast completion (trivial task)
CLAIMED --> FAILED : immediate failure
CLAIMED --> CANCELLED : manual cancel
CLAIMED --> WAITING_FOR_SUBTASKS : agent splits work
CLAIMED --> BLOCKED : dependency discovered
IN_PROGRESS --> DONE : agent reports success
IN_PROGRESS --> FAILED : agent reports failure
IN_PROGRESS --> BLOCKED : dependency discovered
IN_PROGRESS --> WAITING_FOR_SUBTASKS : agent decomposes task
IN_PROGRESS --> OPEN : requeue (force-reassign)
IN_PROGRESS --> CANCELLED : manual cancel
IN_PROGRESS --> ORPHANED : agent crash detected
ORPHANED --> DONE : partial work merged successfully
ORPHANED --> FAILED : unrecoverable
ORPHANED --> OPEN : requeued for retry
BLOCKED --> OPEN : dependency resolved
BLOCKED --> CANCELLED : manual cancel
WAITING_FOR_SUBTASKS --> DONE : all subtasks completed
WAITING_FOR_SUBTASKS --> BLOCKED : subtask timeout escalation
WAITING_FOR_SUBTASKS --> CANCELLED : manual cancel
FAILED --> OPEN : retry (within max_retries)
DONE --> CLOSED : janitor verified + merged
DONE --> FAILED : verification rejected
CLOSED --> [*]
CANCELLED --> [*]
%% PENDING_APPROVAL is a terminal state set directly by the approval
%% subsystem. It has no FSM-managed inbound or outbound transitions.
PENDING_APPROVAL --> [*] Note —
PENDING_APPROVAL: This state exists in theTaskStatusenum and is used by the approval subsystem (seesrc/bernstein/core/security/approval.py). It is set directly rather than through theTASK_TRANSITIONStable, so it has no FSM-managed entry or exit path. Tasks in this state await human review and cannot progress further without manual intervention.
Task Transition Table (exhaustive)¶
Every allowed transition is listed below. The guard function for all transitions is _always (unconditional). Any transition not in this table raises IllegalTransitionError.
| From | To | Trigger |
|---|---|---|
| PLANNED | OPEN | Human approves the planned task |
| PLANNED | CANCELLED | Human rejects the planned task |
| OPEN | CLAIMED | Agent calls claim_next() or claim_by_id() |
| OPEN | WAITING_FOR_SUBTASKS | Task decomposed before agent assignment |
| OPEN | CANCELLED | Manual cancellation |
| CLAIMED | IN_PROGRESS | Agent begins execution |
| CLAIMED | OPEN | Unclaim / force-reassign to different agent |
| CLAIMED | DONE | Fast completion (task was trivial) |
| CLAIMED | FAILED | Immediate failure (e.g., scope violation) |
| CLAIMED | CANCELLED | Manual cancellation |
| CLAIMED | WAITING_FOR_SUBTASKS | Agent splits task into subtasks |
| CLAIMED | BLOCKED | Dependency discovered after claim |
| IN_PROGRESS | DONE | Agent reports successful completion |
| IN_PROGRESS | FAILED | Agent reports failure |
| IN_PROGRESS | BLOCKED | External dependency blocks progress |
| IN_PROGRESS | WAITING_FOR_SUBTASKS | Agent decomposes task mid-execution |
| IN_PROGRESS | OPEN | Force-requeue for different agent |
| IN_PROGRESS | CANCELLED | Manual cancellation |
| IN_PROGRESS | ORPHANED | Heartbeat timeout / agent crash detected |
| ORPHANED | DONE | Partial work saved and merged |
| ORPHANED | FAILED | Crash recovery failed |
| ORPHANED | OPEN | Requeued for retry by another agent |
| BLOCKED | OPEN | Blocking dependency resolved |
| BLOCKED | CANCELLED | Manual cancellation |
| WAITING_FOR_SUBTASKS | DONE | All child subtasks completed |
| WAITING_FOR_SUBTASKS | BLOCKED | Subtask timeout escalation (parent blocked waiting on unresponsive subtask) |
| WAITING_FOR_SUBTASKS | CANCELLED | Manual cancellation |
| FAILED | OPEN | Retry (respects max_retries, default 3) |
| DONE | CLOSED | Janitor verification passed + branch merged |
| DONE | FAILED | Janitor verification rejected the result |
Terminal States¶
Terminal states have no outbound transitions. Computed by the lifecycle kernel: - CLOSED - CANCELLED - PENDING_APPROVAL (awaits external action; no programmatic exit)
Adaptive Timeout¶
Task timeouts are not static. The adaptive timeout system (src/bernstein/core/orchestration/adaptive_timeout.py) adjusts wall-clock timeouts based on historical task durations. Default scope-based timeouts are defined in src/bernstein/core/defaults.py (TASK.scope_timeout_s): small=15 min, medium=30 min, large=60 min, XL=120 min.
Graduated Access Control¶
The graduated access control system (src/bernstein/core/security/graduated_access.py) gates which lifecycle transitions an agent is permitted to perform based on its trust level and track record. New agents start with restricted permissions that expand as they demonstrate reliability.
Agent States (4 states)¶
| Status | Description |
|---|---|
starting | Agent process has been spawned but has not yet confirmed readiness. |
working | Agent is actively executing a task. |
idle | Agent finished its current task and is available for new work. |
dead | Agent process has exited (success, crash, kill, timeout, or recycled). Terminal state. |
Agent State Diagram¶
stateDiagram-v2
[*] --> starting : spawn()
starting --> working : process confirmed alive
starting --> dead : spawn failure / fast exit
working --> idle : task completed, agent awaiting reuse
working --> dead : crash / kill / timeout / circuit break
idle --> working : new task assigned
idle --> dead : idle recycled (resource reclaim)
dead --> [*] Agent Transition Table (exhaustive)¶
| From | To | Trigger |
|---|---|---|
| starting | working | Process started successfully, heartbeat received |
| starting | dead | SpawnError, RateLimitError, or fast exit detection |
| working | idle | Agent finished current task, session still alive |
| working | dead | Process crash (SIGKILL/OOM), manual kill, timeout watchdog, or circuit breaker |
| idle | working | Orchestrator assigns a new task to the existing session |
| idle | dead | Idle recycling (orchestrator reclaims resources from idle agents) |
Transition Metadata¶
Every transition produces a LifecycleEvent with: - timestamp (Unix epoch) - entity_type ("task" or "agent") - entity_id (task ID or session ID) - from_status / to_status - actor (who triggered it: "task_store", "spawner", "janitor", "plan_approval", etc.) - reason (human-readable explanation) - transition_reason (canonical TransitionReason enum, when applicable) - abort_reason (canonical AbortReason enum, for abnormal agent termination)
TransitionReason Values¶
These canonical reasons classify why a lifecycle transition occurred:
| Value | Meaning |
|---|---|
completed | Normal successful completion |
aborted | Explicit abort requested |
retry | Task being retried after failure |
prompt_too_long | Input exceeded model context window |
max_output_tokens | Model hit output token limit |
max_turns | Agent reached max conversation turns |
provider_413 | Provider returned 413 (payload too large) |
provider_529 | Provider returned 529 (overloaded) |
compaction_failed | Context compaction/summarization failed |
stop_hook_blocked | A stop hook prevented the transition |
permission_denied | Insufficient permissions for the operation |
sibling_aborted | A sibling agent in the same group was aborted |
orphan_recovered | Orphaned task was automatically recovered |
AbortReason Values¶
These classify abnormal agent terminations:
| Value | Meaning |
|---|---|
user_interrupt | SIGINT (Ctrl+C) |
shutdown_signal | SIGTERM (graceful shutdown) |
timeout | Watchdog timer expired (exit code 124) |
oom | Out of memory (exit code 137 / SIGKILL) |
permission_denied | Exit code 126 |
provider_error | API provider returned an unrecoverable error |
bash_error | A bash tool invocation caused a fatal error |
sibling_aborted | Cascading abort from sibling agent failure |
parent_aborted | Cascading abort from parent session |
compact_failure | Context window compaction failed |
unknown | Unclassified termination |
TUI Visual States (7 classifications)¶
The Bernstein terminal dashboard classifies agents into visual states derived from session metadata — not from the FSM directly. These presentation-layer states help operators understand agent health at a glance.
Source: src/bernstein/tui/agent_states.py (AgentState, classify_agent_state).
| Visual State | Indicator | Color | Meaning |
|---|---|---|---|
SPAWNING | ◔ | yellow | Agent process is launching. Timeout: 60 s before reclassified as DEAD. |
RUNNING | ● | green | Agent is actively working with a recent heartbeat (in_progress/running status). |
STALLED | ◐ | dark orange | Agent has a PID and active status but no heartbeat for > 5 minutes. |
MERGING | ⇄ | blue | Agent is committing, pushing, or merging results. |
DEAD | ○ | red | Session ended (done, failed, cancelled, killed), or spawn timed out, or no PID on a non-active status. |
IDLE | □ | gray | Agent is waiting for a new task (idle, waiting, or paused status). |
UNKNOWN | ◌ | dim | Unrecognized status string or unexpected metadata combination. |
Mapping to Core FSM States¶
TUI visual states are derived from the core FSM state (starting, working, idle, dead) plus process-level metadata (PID presence, heartbeat timestamp, elapsed time). They do not correspond 1:1 to FSM states.
| Core FSM State | TUI Visual State | Condition |
|---|---|---|
starting | SPAWNING | Spawn age < 60 s |
starting | DEAD | Spawn age ≥ 60 s (timeout) |
working | RUNNING | Last heartbeat < 5 min ago |
working | STALLED | Last heartbeat ≥ 5 min ago |
working | MERGING | Status string is merging / committing / pushing |
idle | IDLE | Agent awaiting next task |
dead | DEAD | Session ended |
| (any) | UNKNOWN | Unclassified metadata combination |
Thresholds (configurable via
AgentStateThresholds): stall threshold = 300 s (5 min), spawn timeout = 60 s.
Agent Turn States (10 states)¶
The agent turn FSM operates at a finer granularity than the agent session FSM above. It tracks the lifecycle of a single task handling turn within an agent process — from the moment a task is claimed through to cleanup.
Source of truth: src/bernstein/core/agents/agent_turn_state.py (AgentTurnState, AgentTurnEvent, AgentTurnStateMachine).
| State | Description |
|---|---|
IDLE | No active turn — agent is between tasks or not yet assigned. |
CLAIMING | A task has been claimed; worktree is being prepared. |
SPAWNING | Agent process has been launched but hasn't started executing yet. |
RUNNING | Agent process is actively working on the task. |
TOOL_USE | Agent is executing an external tool (file editor, shell, search, etc.). |
COMPACTING | Context window is near its limit; compaction/summarization is in progress. |
VERIFYING | Task work is done; janitor or LLM verification is pending. |
COMPLETING | Verification passed; task is being marked done and metrics emitted. |
FAILED | An error, crash, or verification failure occurred. |
REAPED | Cleanup is complete (worktree removed, metrics flushed). Terminal. |
Agent Turn State Diagram¶
Events that drive transitions are shown on each arrow.
stateDiagram-v2
[*] --> IDLE
IDLE --> CLAIMING : task_claimed
CLAIMING --> SPAWNING : agent_spawned
CLAIMING --> FAILED : task_failed
SPAWNING --> RUNNING : agent_spawned
SPAWNING --> FAILED : task_failed
RUNNING --> TOOL_USE : tool_started
RUNNING --> COMPACTING : compact_needed
RUNNING --> VERIFYING : verify_requested
RUNNING --> FAILED : task_failed
TOOL_USE --> RUNNING : tool_completed
TOOL_USE --> FAILED : task_failed
COMPACTING --> RUNNING : verify_requested
COMPACTING --> FAILED : task_failed
VERIFYING --> COMPLETING : task_completed
VERIFYING --> RUNNING : compact_needed
VERIFYING --> FAILED : task_failed
COMPLETING --> REAPED : agent_reaped
FAILED --> REAPED : agent_reaped
REAPED --> [*] Agent Turn Transition Table (exhaustive)¶
| From | Event | To | Notes |
|---|---|---|---|
IDLE | task_claimed | CLAIMING | Orchestrator picks up the next open task |
CLAIMING | agent_spawned | SPAWNING | Worktree ready; CLI process launched |
CLAIMING | task_failed | FAILED | Worktree setup failed (permission error, git conflict) |
SPAWNING | agent_spawned | RUNNING | Process confirmed alive and active |
SPAWNING | task_failed | FAILED | Spawn error or adapter rejected the task |
RUNNING | tool_started | TOOL_USE | Agent invoked a tool (Edit, Bash, Glob, etc.) |
RUNNING | compact_needed | COMPACTING | Context window approaching the model's limit |
RUNNING | verify_requested | VERIFYING | Agent signals it is done; verification begins |
RUNNING | task_failed | FAILED | Runtime error or abort during execution |
TOOL_USE | tool_completed | RUNNING | Tool call finished; agent resumes |
TOOL_USE | task_failed | FAILED | Fatal error inside the tool invocation |
COMPACTING | verify_requested | RUNNING | Compaction done; context summarized; agent continues |
COMPACTING | task_failed | FAILED | Compaction itself failed (compact_failure abort) |
VERIFYING | task_completed | COMPLETING | All completion signals satisfied |
VERIFYING | compact_needed | RUNNING | Context grew during verification; must compact first |
VERIFYING | task_failed | FAILED | Janitor rejected the result |
COMPLETING | agent_reaped | REAPED | Task marked done; worktree removed; metrics flushed |
FAILED | agent_reaped | REAPED | Error handled; resources cleaned up |
Events Reference¶
| Event | Fired by | Meaning |
|---|---|---|
task_claimed | Spawner / orchestrator | A task was successfully reserved for this agent |
agent_spawned | Adapter / spawner | CLI process started (fires twice: at launch and at readiness confirmation) |
tool_started | Agent turn monitor | Agent began a tool call |
tool_completed | Agent turn monitor | Tool call returned |
compact_needed | Token monitor | Context window usage crossed the compaction threshold |
verify_requested | Agent / orchestrator | Agent declared the task finished |
task_completed | Janitor | All completion signals confirmed |
task_failed | Any layer | Unrecoverable error at the current phase |
agent_reaped | Janitor / spawner | Cleanup of process and worktree is complete |
Abort Chain Hierarchy¶
Agent aborts follow a three-level containment hierarchy:
| Scope | Effect | Cascade |
|---|---|---|
| TOOL | Single tool invocation aborted; agent session continues | No cascade |
| SIBLING | Sibling agents (same parent) receive SHUTDOWN signal | Does not affect parent unless AbortPolicy.sibling_to_session is set |
| SESSION | Full agent session torn down; SHUTDOWN cascades to all descendants | Propagates to all children via propagate_abort() |
Escalation between levels is opt-in via AbortPolicy. By default, each level contains its failure without propagating upward.