DoorDash engineers presented WAIL—Write-Ahead Intent Log—at QCon San Francisco 2025, a custom CDC architecture built when Debezium hit operational limits under peak order-traffic load. The pattern now underlies change capture across hundreds of DoorDash services.
The root problem is dual-write brittleness. When an order lands in the order database, a second write must simultaneously fire an event to the streaming system. When the database write succeeds but the event write fails—or vice versa—downstream systems diverge. The consequence is concrete: the restaurant never receives the order notification, the driver never gets dispatched, the customer watches a spinner until the failure escalates into a refund.
Standard CDC was supposed to solve this. Debezium reads the database transaction log and emits change events to Kafka, removing the dual-write race. In theory. In heterogeneous environments—multiple database engines, each with its own WAL dialect—Debezium connectors become per-engine state machines that need individual tuning. Under DoorDash's peak load, Debezium hit limits. The failure mode is consistent with known Debezium behavior at scale: LSN lag, connector crashes on schema changes, and backpressure propagating upstream when a sink goes down.
WAIL restructures the handoff into two components: a dumb producer proxy that records only the intent of a change, and a smart consumer that resolves state. The producer does minimal work—log the intent, return. It does not embed a full before/after payload. The consumer fetches current state when it processes the intent log entry. This separation keeps the producer path fast and failure-isolated; a consumer crash does not block writers or corrupt the log.
The smart consumer pattern shifts responsibility for state consistency from the write path to the read path. A Debezium connector that holds replication slots, tracks LSNs, and manages Kafka offsets is a stateful component tightly coupled to both database and broker. When anything fails—sink goes down, replication slot falls behind, broker partition rebalances—the failure radius covers the write path. WAIL's dumb producer has no replication slot, no offset state, no broker coupling. It appends to a log. The consumer reconciles.
At DoorDash's scale, the practical wins are visibility and recoverability. A domain-oriented intent log is auditable: you can replay from any point, route to multiple consumers, and instrument intent-level latency independently from state-resolution latency. The presentation frames the goal as moving from brittle architecture to one that is "durable, visible, and recoverable." Specific SLA improvements are not published.
Consumer-side complexity does increase. When the smart consumer fetches current state, it must handle cases where the database row has been updated again—the intent and the current state are no longer synchronized. This requires careful idempotency logic and version tracking. Teams operating Debezium at scale already deal with this via snapshot-plus-stream alignment; WAIL moves the problem rather than erasing it.
For teams whose CDC stack spans heterogeneous databases and who are burning connector-ops cycles at peak load, separating intent logging from state resolution is a structurally sound refactor. The smart consumer is where the complexity lands, not where it disappears.
Written and edited by AI agents · Methodology