CAS Store¶
Where do duplicated artifact bytes go to die?
Bernstein stores agent outputs — file dumps, log fragments, structured state — in a content-addressable store keyed by the SHA-256 digest of their bytes. Identical content produces identical keys, so writing the same payload twice costs zero extra bytes on disk. The layout is a flat-file knock-off of git's object store, living under .sdd/cas/.
If you only have time for one sentence: the CAS store is a directory of SHA-256-keyed blobs at .sdd/cas/<xx>/<sha256>, with a .meta.json sidecar per blob, and put() is a no-op on duplicates. The rest of this page covers usage, on-disk layout, GC, and Merkle integrity.
What content-addressed storage is¶
A normal file store is named: you write report.json and look it up by name. Two writes of the same bytes use twice the disk.
A content-addressed store is keyed by hash: you write report.json and the store hands you back a 64-char hex digest. Look-up is by digest, not name. Identical bytes produce the same digest, so a second write of the same content returns the existing digest without storing anything new. Properties that follow:
- Automatic dedup. Two agents emitting the same artifact share one blob.
- Tamper evidence. A digest mismatch on read = corruption, full stop. Pair with a Merkle tree to prove the whole store hasn't been tampered with.
- Cheap snapshots. A "manifest" (list of digests) is a tiny pointer into the blob store; you can move snapshots around without copying blob content.
- Immutable by construction. You cannot rewrite a blob without its digest changing. The only mutation is delete.
Where Bernstein uses CAS¶
The store is a primitive other persistence subsystems can plug into. Today the wired-in consumer surface is small (the module is one of the "undocumented surprises" surfaced by code-surface inventory) — CASStore is constructed against .sdd/cas/ and exposed for:
- Artifact dedup. Agent outputs that two or more runs would produce identically (lockfile dumps, generated configs, large prompt-cache snapshots) get hashed into CAS so disk usage tracks unique bytes, not write count.
- Replay state. WAL replay (
wal_replay.py) restores the task graph on restart; large blob payloads referenced from WAL entries live in CAS rather than being copied inline. - Audit evidence. Tamper-evident audit logs use Merkle seals over CAS-resident content; the seal can be verified without re-reading the original blobs.
- Snapshot manifests.
bernstein drsnapshots reference CAS digests instead of duplicating large files into the snapshot dir.
bernstein.core.persistence.cas_store is the only module that touches the on-disk layout — every other consumer goes through put() / get() / has() / delete() so the store layout is free to evolve.
On-disk layout¶
.sdd/cas/ mirrors git's object database:
.sdd/cas/
├── 0a/
│ ├── 0a3f4c2e8b1d... 5c ← blob (raw bytes)
│ └── 0a3f4c2e8b1d... 5c.meta.json ← sidecar metadata
├── 1f/
│ ├── 1f99ab... ← blob
│ └── 1f99ab... .meta.json ← sidecar
└── ...
Sharded by the first two hex characters of the digest (_shard_dir at cas_store.py:114-116) so no single directory holds millions of files. With a uniform SHA-256 distribution that's 256 possible shards — plenty of headroom.
Each blob has a JSON sidecar (<digest>.meta.json) containing the CASEntry fields:
| Field | Type | Meaning |
|---|---|---|
digest | string | SHA-256 hex (matches the filename). |
size_bytes | int | Content length. |
created_at | float | Unix timestamp of first insertion. |
content_type | string | MIME-style tag (text/x-python, application/json, ...). |
metadata | dict | Arbitrary user-supplied metadata. |
Source: CASEntry at cas_store.py:42-58.
has(digest) requires both blob and sidecar to exist (cas_store.py:215-216); a half-written entry is reported absent so recovery code can treat it as such.
API¶
CASStore is a thin class around the directory layout (cas_store.py:87-300):
from pathlib import Path
from bernstein.core.persistence.cas_store import CASStore, put_text
store = CASStore(Path(".sdd/cas"))
digest = store.put(b"hello world", content_type="text/plain")
assert store.get(digest) == b"hello world"
assert store.has(digest)
# Convenience wrappers for the common cases:
digest = put_text(store, "some text", metadata={"role": "qa"})
Public methods:
put(content, content_type, metadata)→ digest. No-op if digest already exists (_dedup_savescounter increments).get(digest)→ bytes orNone. Validates the digest format first to prevent path traversal (_validate_digest).has(digest)→ bool.delete(digest)→ bool. Removes blob + sidecar; cleans up empty shard directories.get_entry(digest)→CASEntryorNone.list_entries()→ list of all entries, sorted bycreated_at.stats()→CASStats(total_entries, total_bytes, dedup_saves)for dashboards.
Convenience helpers in the same module:
put_file(store, path, metadata)— read a file, guesscontent_typefrom the suffix, store withsource_filein metadata.put_text(store, text, metadata)— UTF-8 encode, store astext/plain.
Digest validation is regex-based (_HEX_RE = r"\A[0-9a-f]{64}\Z") so a malformed digest never reaches Path and a directory-traversal .. can't sneak through.
Garbage collection¶
CAS entries are never automatically deleted by the store itself. Pruning is the responsibility of higher-level subsystems that know which digests are still referenced. Today GC happens during these operations:
- Project reset /
bernstein cleanup. Removes the entire.sdd/cas/tree along with the rest of.sdd/runtime/. Use this if you've confirmed nothing in the durable state still points at CAS digests. - Disaster-recovery snapshot rotation. Old
bernstein drsnapshots drop their reference to CAS digests; a follow-up sweep can delete any blob whose digest is no longer referenced by any snapshot manifest. - Audit seal expiry. When a Merkle seal ages out, the CAS entries it referenced can be deleted via
store.delete(digest).
Because delete() is explicit and per-digest, GC is essentially "find the orphans and call delete on each one". A reference scan over WAL + snapshots + audit seals produces the live set; everything in store.list_entries() not in the live set is orphaned.
There is no built-in mark-and-sweep job. If your workload churns through artifacts (e.g. heavy snapshot rotation), wire one up against your reference set and run it as a periodic CLI step.
Integrity: Merkle hash tree¶
CAS pairs naturally with the Merkle integrity layer at core/persistence/merkle.py. The Merkle tree builds a binary hash tree over a deterministically-ordered list of (path, leaf_hash) pairs and publishes the root as a single SHA-256 string. For CAS entries:
- The leaf hash for a blob is its digest (which is its SHA-256 by construction).
- The internal nodes combine children with a domain-separated hash:
sha256("merkle:" + left + ":" + right)(merkle.py:_combine_hashesatmerkle.py:56-58). - The root signs the entire CAS state at a point in time. A single root hash proves the contents of every leaf without re-reading them.
Seals are JSON files at .sdd/audit/merkle/seal-<ISO-timestamp>.json and serve compliance evidence: a verifier rebuilds the tree from the on-disk leaves, compares roots, and reports tamper.
If a CAS blob is corrupted (digest mismatch on read) or deleted underneath the Merkle layer, the next seal verification will surface the discrepancy.
Cross-links¶
-
See
state-persistence.mdfor the full.sdd/layout and where CAS sits relative to WAL, audit logs, and the backlog. Thestate-persistencedoc lists CAS as one of the "durable" surfaces — meaning it survives a restart and you should not gitignore it if you depend on artifact dedup across runs. -
See
warm-pool.mdfor the orthogonal optimisation on the spawn path (the warm pool dedups processes; CAS dedups bytes).
Code pointers¶
| Concern | File |
|---|---|
| Store implementation (put/get/has/delete) | src/bernstein/core/persistence/cas_store.py |
CASEntry / CASStats data classes | cas_store.py:42-79 |
put_file / put_text helpers | cas_store.py:308-376 |
| Digest validation (path-traversal guard) | _HEX_RE, _validate_digest at cas_store.py:101-112 |
| Shard layout | _shard_dir, _blob_path, _meta_path at cas_store.py:114-124 |
| Merkle leaf hashing | src/bernstein/core/persistence/merkle.py:66-86 |
| Merkle tree builder | merkle.py:94- |
| State-persistence overview | docs/architecture/state-persistence.md |