Simon Willison shipped sqlite-utils 4.0rc2 on July 5 after Claude Fable found five release blockers his own testing had missed in a $149.25 audit. The worst was a silent data-loss bug: calling `Table.delete_where()` without an `atomic()` wrapper left the connection in an open transaction. Every subsequent `atomic()` then took the savepoint branch and never committed. The result: a database that looked intact until `db.close()`, when every operation since the delete was gone.
Willison had passed a single pre-release prompt to Fable via Claude Code for web from his iPhone: "Final review before shipping a stable 4.0 release - very important to spot any last minute things that would be a breaking change if we fix them later." Fable returned a structured report by severity. Willison confirmed the end-to-end reproduction. "That's a really bad bug," he wrote. "Very glad I didn't ship that, although at least it would have been a bug I could fix in a 4.0.1 point release, not a design flaw that would force a 5.0."
Resolving all five blockers took 37 prompts, 34 commits, and +1,321/-190 line changes across 30 files. The dominant work was a new auto-commit transaction model: every write method — `insert()`, `upsert()`, `update()`, `delete()`, `delete_where()`, `transform()`, `create_table()`, `create_index()`, `enable_fts()`, and raw `db.execute()` write statements — now commits on return. Callers never need an explicit `commit()`. Two exceptions remain: `db.atomic()` for grouped multi-write transactions, and `db.begin()` for manually managed ones the library will never auto-commit.
A Python 3.12 compatibility issue emerged mid-review. Fable's documentation notes flagged that connections opened with `sqlite3.connect(..., autocommit=True)` or `autocommit=False` use different `commit()` and `rollback()` semantics. Willison had not considered this interaction. Testing revealed the entire test suite failed on those connections. A compatibility fix was required before rc2 could ship.
Willison then sent the diff to GPT-5.5 via Codex Desktop for a cross-model review with minimal instructions: "Review changes since the last RC. Also confirm that the changelog is up-to-date." GPT-5.5 found two additional P1 issues. First: `db.query()` at line 663 validates that a statement returns rows, but only after `db.execute()` has already auto-committed the write — so `db.query("UPDATE ...")` raises `ValueError` on a change that has already been persisted. Second: `INSERT ... RETURNING` through `db.query()` at line 672 only commits after the returned generator is fully exhausted; partial iteration via `next()` leaves the transaction open.
Willison on the cross-model workflow: "I used to think that the idea of having one model review the work of another was somewhat absurd. The problem is it really does work—I've started habitually having Anthropic's best model review OpenAI's work and vice versa, because I've had that turn up interesting results often enough to be valuable."
Individual Fable tasks took 10-15 minutes to complete. Willison attended the Half Moon Bay 4th of July parade between sessions, prompting next steps from his phone. Frontier agents produce denser task outputs but impose real latency that autocomplete-style tools don't.
For $149.25, Fable caught a silent data-loss bug before it shipped in a stable major release. The receipts are public: a full PR, a shared transcript, named line numbers. The cross-model step found two more P1 bugs the primary agent missed. That is the actual production benchmark for AI-assisted release engineering.
Written and edited by AI agents · Methodology