Verification tracking¶
Bernstein flags task completions that finish without any sign of verification — no tests, no quality gates, no completion-signal check — and raises an alert when the rate of those unverified completions crosses a threshold. This page is the operator's guide to that signal: what counts as "verified", when the alert fires, how to configure it, and what to do when one shows up.
If you just want to know:
- Where the data lives:
.sdd/metrics/verification_nudges.jsonl - Where the alert surfaces:
bernstein status(CLI),/status/(HTTP), the TUI dashboard. - Default trigger: more than 30% unverified, with at least 3 completions in the window.
What "verified" means¶
A completion is verified if any of these are true:
| Evidence type | Source | Field |
|---|---|---|
| Tests run | Agent log summary | tests_run |
| Quality gates run | Quality gate result object | quality_gates_run |
| Completion signals checked | Janitor verify_task() | completion_signals_checked |
The logic is a simple OR:
A task whose log summary shows none of these is unverified. The tracker writes one record per completion to the JSONL ledger and stamps task.verification_count (0–3) and task.flagged_unverified (bool) on the task object so any downstream consumer can filter on them.
When the alert fires¶
The tracker keeps a running summary with two parameters:
| Parameter | Default | What it does |
|---|---|---|
nudge_threshold | 0.3 | Unverified ratio above which alerts fire |
MIN_COMPLETIONS_FOR_NUDGE | 3 | Minimum completions before threshold checks |
The math:
The comparison is strict (>, not >=) — exactly 30% does not trigger. The MIN_COMPLETIONS_FOR_NUDGE floor exists so the very first unverified completion in a fresh session does not flip the alert (1/1 = 100%).
Where it surfaces¶
| Surface | Condition | What you see |
|---|---|---|
GET /status/ API | always | verification_nudge object in JSON |
bernstein status CLI | threshold_exceeded | red ALERT with counts and ratio |
bernstein status CLI | unverified > 0 | yellow Notice with counts |
| TUI dashboard | first time threshold trips | toast notification, severity=warning, 10 s timeout |
The API response shape is small enough to paste into a runbook:
{
"total_completions": 10,
"verified_count": 6,
"unverified_count": 4,
"unverified_ratio": 0.4,
"threshold_exceeded": true,
"nudge_threshold": 0.3,
"recent_unverified": ["task-a", "task-b", "task-c"]
}
Configuration¶
YAML (bernstein.yaml)¶
verification_nudge:
threshold: 0.3 # 0.0 = alert on any unverified, 1.0 = never alert
min_completions: 3 # how many completions before threshold matters
Tightening the bar¶
A few tuning patterns we have seen work:
| Goal | Threshold | Min completions |
|---|---|---|
| Maximum sensitivity (CI, release branches) | 0.10 | 3 |
| Default | 0.30 | 3 |
| Sandboxes / spike work where verification is rare | 0.50 | 5 |
| "Tell me only when something is really wrong" | 0.70 | 10 |
If you raise threshold above 0.5 you are silencing the signal more than tuning it; consider whether you actually want this gate at all.
Resetting state¶
The ledger is append-only. To reset between runs:
- delete
.sdd/metrics/verification_nudges.jsonl, or - call
tracker.reset()from a hook or shell script.
The in-memory tracker also resets at process exit.
Operator playbook¶
You see a red ALERT in bernstein status. What now?
-
Don't panic — and don't disable the alert. The signal is a ratio, not an error. It only means more than
thresholdof recent completions had no verification evidence at all. The agent likely did real work; it just did not run tests or trip a quality gate. -
Pull the recent unverified IDs. From
bernstein statusor:
-
Spot-check one. Pick a flagged task ID. Open its log summary and confirm: did it really skip tests, or is the agent's log summary missing the evidence Bernstein looks for? The latter is a parsing miss (assumption A1 in the spec) — fix the adapter, not the threshold.
-
If the agent is genuinely skipping verification, look at:
- The plan: did the YAML omit a
verifystep? - The quality gates: are they wired up but failing fast?
-
The model: is it deciding "this is trivial, no test needed" when it actually needs one? Consider tightening prompts or adding a hook that forces
tests_run. -
If you intentionally allow unverified completions (e.g. doc fixes, single-line constants), raise
min_completionsrather than the threshold. That suppresses the alert during quiet sessions without lying about busy ones. -
Resolve the alert by completing more verified tasks (the ratio drifts back below threshold) or by resetting the ledger if you need a clean baseline for a release.
The alert is not auto-clearing once you fix the underlying issue — it tracks completions in a window. If the window keeps including old unverified completions, the ratio stays high. Reset the ledger or wait for the unverified ones to age out.
Code pointers¶
| File | What it does |
|---|---|
src/bernstein/core/quality/verification_nudge.py | VerificationNudgeTracker, VerificationRecord, NudgeSummary, load_nudge_summary() |
src/bernstein/core/models.py | Task.verification_count, Task.flagged_unverified fields |
src/bernstein/core/quality/janitor.py | verify_task() — supplies the completion_signals_checked evidence |
tests/unit/test_verification_nudge.py | 44 tests across 8 classes (record, persistence, summary, alert thresholds) |
JSONL ledger schema (one object per line):
{
"task_id": "string",
"session_id": "string",
"timestamp": 1712200000.0,
"tests_run": false,
"quality_gates_run": false,
"completion_signals_checked": false,
"verified": false
}
The full engineering spec lives at dev/specs/internal-workflows/WORKFLOW-verification-nudge.md.
Related¶
- Permission modes — how the approval gate decides whether a completion needs human signoff.
- Runbooks — automated remediation for failing tasks.