Skip to content

Lifecycle State Machines

Bernstein uses deterministic finite state machines (FSMs) for both task and agent lifecycle management. All transitions flow through the Lifecycle Governance Kernel (core/tasks/lifecycle.py), which validates transitions against an explicit transition table, rejects illegal moves with IllegalTransitionError, and emits typed LifecycleEvent records for audit, replay, and metrics.

Source of truth: src/bernstein/core/tasks/lifecycle.py (transition tables), src/bernstein/core/tasks/models.py (TaskStatus enum, AgentSession dataclass).


Task States (12 states)

Status Description
PLANNED Awaiting human approval before execution (plan mode). Tasks created from plan YAML files start here.
OPEN Ready for an agent to claim. The default starting state for dynamically created tasks.
CLAIMED An agent has claimed this task but has not yet started work.
IN_PROGRESS Agent is actively working on the task.
DONE Agent reported completion. Pending janitor verification and merge.
CLOSED Verified and merged. Terminal state.
FAILED Agent reported failure or verification rejected the result. Can be retried.
BLOCKED Waiting on an external dependency (another task, resource, or approval).
WAITING_FOR_SUBTASKS Parent task waiting for child subtasks to complete (agent decomposed work).
CANCELLED Manually or programmatically cancelled. Terminal state.
ORPHANED Agent crashed mid-task; pending crash recovery by the orchestrator.
PENDING_APPROVAL Task completed but requires human approval before taking effect.

Task State Diagram

stateDiagram-v2
    [*] --> PLANNED : plan mode
    [*] --> OPEN : dynamic creation

    PLANNED --> OPEN : approved
    PLANNED --> CANCELLED : rejected

    OPEN --> CLAIMED : agent claims task
    OPEN --> WAITING_FOR_SUBTASKS : decomposed before claim
    OPEN --> CANCELLED : manual cancel

    CLAIMED --> IN_PROGRESS : agent starts work
    CLAIMED --> OPEN : unclaim / force-reassign
    CLAIMED --> DONE : fast completion (trivial task)
    CLAIMED --> FAILED : immediate failure
    CLAIMED --> CANCELLED : manual cancel
    CLAIMED --> WAITING_FOR_SUBTASKS : agent splits work
    CLAIMED --> BLOCKED : dependency discovered

    IN_PROGRESS --> DONE : agent reports success
    IN_PROGRESS --> FAILED : agent reports failure
    IN_PROGRESS --> BLOCKED : dependency discovered
    IN_PROGRESS --> WAITING_FOR_SUBTASKS : agent decomposes task
    IN_PROGRESS --> OPEN : requeue (force-reassign)
    IN_PROGRESS --> CANCELLED : manual cancel
    IN_PROGRESS --> ORPHANED : agent crash detected

    ORPHANED --> DONE : partial work merged successfully
    ORPHANED --> FAILED : unrecoverable
    ORPHANED --> OPEN : requeued for retry

    BLOCKED --> OPEN : dependency resolved
    BLOCKED --> CANCELLED : manual cancel

    WAITING_FOR_SUBTASKS --> DONE : all subtasks completed
    WAITING_FOR_SUBTASKS --> BLOCKED : subtask timeout escalation
    WAITING_FOR_SUBTASKS --> CANCELLED : manual cancel

    FAILED --> OPEN : retry (within max_retries)

    DONE --> CLOSED : janitor verified + merged
    DONE --> FAILED : verification rejected

    CLOSED --> [*]
    CANCELLED --> [*]

    %% PENDING_APPROVAL is a terminal state set directly by the approval
    %% subsystem. It has no FSM-managed inbound or outbound transitions.
    PENDING_APPROVAL --> [*]

Note — PENDING_APPROVAL: This state exists in the TaskStatus enum and is used by the approval subsystem (see src/bernstein/core/security/approval.py). It is set directly rather than through the TASK_TRANSITIONS table, so it has no FSM-managed entry or exit path. Tasks in this state await human review and cannot progress further without manual intervention.

Task Transition Table (exhaustive)

Every allowed transition is listed below. The guard function for all transitions is _always (unconditional). Any transition not in this table raises IllegalTransitionError.

From To Trigger
PLANNED OPEN Human approves the planned task
PLANNED CANCELLED Human rejects the planned task
OPEN CLAIMED Agent calls claim_next() or claim_by_id()
OPEN WAITING_FOR_SUBTASKS Task decomposed before agent assignment
OPEN CANCELLED Manual cancellation
CLAIMED IN_PROGRESS Agent begins execution
CLAIMED OPEN Unclaim / force-reassign to different agent
CLAIMED DONE Fast completion (task was trivial)
CLAIMED FAILED Immediate failure (e.g., scope violation)
CLAIMED CANCELLED Manual cancellation
CLAIMED WAITING_FOR_SUBTASKS Agent splits task into subtasks
CLAIMED BLOCKED Dependency discovered after claim
IN_PROGRESS DONE Agent reports successful completion
IN_PROGRESS FAILED Agent reports failure
IN_PROGRESS BLOCKED External dependency blocks progress
IN_PROGRESS WAITING_FOR_SUBTASKS Agent decomposes task mid-execution
IN_PROGRESS OPEN Force-requeue for different agent
IN_PROGRESS CANCELLED Manual cancellation
IN_PROGRESS ORPHANED Heartbeat timeout / agent crash detected
ORPHANED DONE Partial work saved and merged
ORPHANED FAILED Crash recovery failed
ORPHANED OPEN Requeued for retry by another agent
BLOCKED OPEN Blocking dependency resolved
BLOCKED CANCELLED Manual cancellation
WAITING_FOR_SUBTASKS DONE All child subtasks completed
WAITING_FOR_SUBTASKS BLOCKED Subtask timeout escalation (parent blocked waiting on unresponsive subtask)
WAITING_FOR_SUBTASKS CANCELLED Manual cancellation
FAILED OPEN Retry (respects max_retries, default 3)
DONE CLOSED Janitor verification passed + branch merged
DONE FAILED Janitor verification rejected the result

Terminal States

Terminal states have no outbound transitions. Computed by the lifecycle kernel: - CLOSED - CANCELLED - PENDING_APPROVAL (awaits external action; no programmatic exit)

Adaptive Timeout

Task timeouts are not static. The adaptive timeout system (src/bernstein/core/orchestration/adaptive_timeout.py) adjusts wall-clock timeouts based on historical task durations. Default scope-based timeouts are defined in src/bernstein/core/defaults.py (TASK.scope_timeout_s): small=15 min, medium=30 min, large=60 min, XL=120 min.

Graduated Access Control

The graduated access control system (src/bernstein/core/security/graduated_access.py) gates which lifecycle transitions an agent is permitted to perform based on its trust level and track record. New agents start with restricted permissions that expand as they demonstrate reliability.


Agent States (4 states)

Status Description
starting Agent process has been spawned but has not yet confirmed readiness.
working Agent is actively executing a task.
idle Agent finished its current task and is available for new work.
dead Agent process has exited (success, crash, kill, timeout, or recycled). Terminal state.

Agent State Diagram

stateDiagram-v2
    [*] --> starting : spawn()

    starting --> working : process confirmed alive
    starting --> dead : spawn failure / fast exit

    working --> idle : task completed, agent awaiting reuse
    working --> dead : crash / kill / timeout / circuit break

    idle --> working : new task assigned
    idle --> dead : idle recycled (resource reclaim)

    dead --> [*]

Agent Transition Table (exhaustive)

From To Trigger
starting working Process started successfully, heartbeat received
starting dead SpawnError, RateLimitError, or fast exit detection
working idle Agent finished current task, session still alive
working dead Process crash (SIGKILL/OOM), manual kill, timeout watchdog, or circuit breaker
idle working Orchestrator assigns a new task to the existing session
idle dead Idle recycling (orchestrator reclaims resources from idle agents)

Transition Metadata

Every transition produces a LifecycleEvent with: - timestamp (Unix epoch) - entity_type ("task" or "agent") - entity_id (task ID or session ID) - from_status / to_status - actor (who triggered it: "task_store", "spawner", "janitor", "plan_approval", etc.) - reason (human-readable explanation) - transition_reason (canonical TransitionReason enum, when applicable) - abort_reason (canonical AbortReason enum, for abnormal agent termination)

TransitionReason Values

These canonical reasons classify why a lifecycle transition occurred:

Value Meaning
completed Normal successful completion
aborted Explicit abort requested
retry Task being retried after failure
prompt_too_long Input exceeded model context window
max_output_tokens Model hit output token limit
max_turns Agent reached max conversation turns
provider_413 Provider returned 413 (payload too large)
provider_529 Provider returned 529 (overloaded)
compaction_failed Context compaction/summarization failed
stop_hook_blocked A stop hook prevented the transition
permission_denied Insufficient permissions for the operation
sibling_aborted A sibling agent in the same group was aborted
orphan_recovered Orphaned task was automatically recovered

AbortReason Values

These classify abnormal agent terminations:

Value Meaning
user_interrupt SIGINT (Ctrl+C)
shutdown_signal SIGTERM (graceful shutdown)
timeout Watchdog timer expired (exit code 124)
oom Out of memory (exit code 137 / SIGKILL)
permission_denied Exit code 126
provider_error API provider returned an unrecoverable error
bash_error A bash tool invocation caused a fatal error
sibling_aborted Cascading abort from sibling agent failure
parent_aborted Cascading abort from parent session
compact_failure Context window compaction failed
unknown Unclassified termination

TUI Visual States (7 classifications)

The Bernstein terminal dashboard classifies agents into visual states derived from session metadata — not from the FSM directly. These presentation-layer states help operators understand agent health at a glance.

Source: src/bernstein/tui/agent_states.py (AgentState, classify_agent_state).

Visual State Indicator Color Meaning
SPAWNING yellow Agent process is launching. Timeout: 60 s before reclassified as DEAD.
RUNNING green Agent is actively working with a recent heartbeat (in_progress/running status).
STALLED dark orange Agent has a PID and active status but no heartbeat for > 5 minutes.
MERGING blue Agent is committing, pushing, or merging results.
DEAD red Session ended (done, failed, cancelled, killed), or spawn timed out, or no PID on a non-active status.
IDLE gray Agent is waiting for a new task (idle, waiting, or paused status).
UNKNOWN dim Unrecognized status string or unexpected metadata combination.

Mapping to Core FSM States

TUI visual states are derived from the core FSM state (starting, working, idle, dead) plus process-level metadata (PID presence, heartbeat timestamp, elapsed time). They do not correspond 1:1 to FSM states.

Core FSM State TUI Visual State Condition
starting SPAWNING Spawn age < 60 s
starting DEAD Spawn age ≥ 60 s (timeout)
working RUNNING Last heartbeat < 5 min ago
working STALLED Last heartbeat ≥ 5 min ago
working MERGING Status string is merging / committing / pushing
idle IDLE Agent awaiting next task
dead DEAD Session ended
(any) UNKNOWN Unclassified metadata combination

Thresholds (configurable via AgentStateThresholds): stall threshold = 300 s (5 min), spawn timeout = 60 s.


Agent Turn States (10 states)

The agent turn FSM operates at a finer granularity than the agent session FSM above. It tracks the lifecycle of a single task handling turn within an agent process — from the moment a task is claimed through to cleanup.

Source of truth: src/bernstein/core/agents/agent_turn_state.py (AgentTurnState, AgentTurnEvent, AgentTurnStateMachine).

State Description
IDLE No active turn — agent is between tasks or not yet assigned.
CLAIMING A task has been claimed; worktree is being prepared.
SPAWNING Agent process has been launched but hasn't started executing yet.
RUNNING Agent process is actively working on the task.
TOOL_USE Agent is executing an external tool (file editor, shell, search, etc.).
COMPACTING Context window is near its limit; compaction/summarization is in progress.
VERIFYING Task work is done; janitor or LLM verification is pending.
COMPLETING Verification passed; task is being marked done and metrics emitted.
FAILED An error, crash, or verification failure occurred.
REAPED Cleanup is complete (worktree removed, metrics flushed). Terminal.

Agent Turn State Diagram

Events that drive transitions are shown on each arrow.

stateDiagram-v2
    [*] --> IDLE

    IDLE --> CLAIMING : task_claimed

    CLAIMING --> SPAWNING : agent_spawned
    CLAIMING --> FAILED : task_failed

    SPAWNING --> RUNNING : agent_spawned
    SPAWNING --> FAILED : task_failed

    RUNNING --> TOOL_USE : tool_started
    RUNNING --> COMPACTING : compact_needed
    RUNNING --> VERIFYING : verify_requested
    RUNNING --> FAILED : task_failed

    TOOL_USE --> RUNNING : tool_completed
    TOOL_USE --> FAILED : task_failed

    COMPACTING --> RUNNING : verify_requested
    COMPACTING --> FAILED : task_failed

    VERIFYING --> COMPLETING : task_completed
    VERIFYING --> RUNNING : compact_needed
    VERIFYING --> FAILED : task_failed

    COMPLETING --> REAPED : agent_reaped

    FAILED --> REAPED : agent_reaped

    REAPED --> [*]

Agent Turn Transition Table (exhaustive)

From Event To Notes
IDLE task_claimed CLAIMING Orchestrator picks up the next open task
CLAIMING agent_spawned SPAWNING Worktree ready; CLI process launched
CLAIMING task_failed FAILED Worktree setup failed (permission error, git conflict)
SPAWNING agent_spawned RUNNING Process confirmed alive and active
SPAWNING task_failed FAILED Spawn error or adapter rejected the task
RUNNING tool_started TOOL_USE Agent invoked a tool (Edit, Bash, Glob, etc.)
RUNNING compact_needed COMPACTING Context window approaching the model's limit
RUNNING verify_requested VERIFYING Agent signals it is done; verification begins
RUNNING task_failed FAILED Runtime error or abort during execution
TOOL_USE tool_completed RUNNING Tool call finished; agent resumes
TOOL_USE task_failed FAILED Fatal error inside the tool invocation
COMPACTING verify_requested RUNNING Compaction done; context summarized; agent continues
COMPACTING task_failed FAILED Compaction itself failed (compact_failure abort)
VERIFYING task_completed COMPLETING All completion signals satisfied
VERIFYING compact_needed RUNNING Context grew during verification; must compact first
VERIFYING task_failed FAILED Janitor rejected the result
COMPLETING agent_reaped REAPED Task marked done; worktree removed; metrics flushed
FAILED agent_reaped REAPED Error handled; resources cleaned up

Events Reference

Event Fired by Meaning
task_claimed Spawner / orchestrator A task was successfully reserved for this agent
agent_spawned Adapter / spawner CLI process started (fires twice: at launch and at readiness confirmation)
tool_started Agent turn monitor Agent began a tool call
tool_completed Agent turn monitor Tool call returned
compact_needed Token monitor Context window usage crossed the compaction threshold
verify_requested Agent / orchestrator Agent declared the task finished
task_completed Janitor All completion signals confirmed
task_failed Any layer Unrecoverable error at the current phase
agent_reaped Janitor / spawner Cleanup of process and worktree is complete

Abort Chain Hierarchy

Agent aborts follow a three-level containment hierarchy:

TOOL  <  SIBLING  <  SESSION
Scope Effect Cascade
TOOL Single tool invocation aborted; agent session continues No cascade
SIBLING Sibling agents (same parent) receive SHUTDOWN signal Does not affect parent unless AbortPolicy.sibling_to_session is set
SESSION Full agent session torn down; SHUTDOWN cascades to all descendants Propagates to all children via propagate_abort()

Escalation between levels is opt-in via AbortPolicy. By default, each level contains its failure without propagating upward.