A production post-mortem published May 2026 on InfoQ documents how a search-and-ads retrieval pipeline migrated from scheduled batch jobs to Spark Structured Streaming in micro-batch mode. Author Parveen Saini's account reveals where the architecture cracked under production load: the team discovered that scheduling delay and orchestration overhead, not computation, posed the primary bottleneck. The standard toolkit for solving that problem carried its own failure modes.
The pipeline maintained a Solr-backed inverted index covering several million documents. Full rebuilds took two to three hours, with validation and deployment pushing total turnaround to roughly five hours, making frequent full rebuilds impractical. The delta pipeline ingested new ads, campaign updates, and behavioral signals like co-purchase data. It ran on time-partitioned files in S3-style object storage, receiving new incremental data every five to seven minutes. Each delta run covered the last five hours of partitions, with multiple runs per hour expected. A staleness gap of one scheduling cycle translated directly into delayed ad activation and stale retrieval results.
The team converted the batch jobs to continuously running micro-batches using Spark Structured Streaming. They did not rely on Spark's native checkpointing or event-time watermark semantics. The pipeline advanced on partition-level progress rather than ordered event streams. The team maintained an external logical watermark tracking the latest processed partition by timestamp. Progress was determined by listing and interpreting partitioned data in object storage, not by consuming an ordered log.
Two failure categories dominated the post-mortem. First, S3 eventual consistency made completion markers and success-file patterns unreliable as signals that a partition was ready to process. The team adopted deterministic, rate-based progress — advancing by time rather than waiting for an explicit "done" signal. This approach held under production variance. Second, lag and restart semantics required explicit design rather than inheritance from the framework. In a freshness-driven pipeline with overlapping window semantics, replaying the full backlog after a restart degraded freshness further. The fix: skip directly to the latest available partition on restart, treating missed intermediate states as acceptable loss for immediate freshness recovery.
For enterprise data architects, the structural implication is direct. Teams migrating batch pipelines to streaming often default to Kafka or a managed streaming service. This post-mortem argues that for object-store-based pipelines — which cover a large share of enterprise data infrastructure — that migration introduces per-record operational complexity without delivering meaningful latency improvement. The scheduling delay, not the processing model, drives freshness lag. Micro-batch over existing object storage with explicit external watermark management closes that gap while keeping operational surface area close to batch infrastructure teams already know.
The second implication targets ML inference pipelines. Teams piping feature data or retrieval indexes to LLM inference endpoints face the same freshness-vs-complexity tradeoff. The report's finding that long-running streaming jobs should treat restarts as normal operations — not failure conditions — applies directly to any continuously running feature engineering or embedding refresh job feeding a model serving layer.
Open questions remain: whether the external watermark approach scales to multi-tenant environments where partition ownership is shared, and how the architecture interacts with Delta Lake's transaction log when used as the storage layer rather than raw S3. The case study is scoped to a fixed-function pipeline. Saini explicitly notes the design choices are constraint-specific.
The takeaway for platform teams: if streaming migration is stalled on Kafka adoption, it may be because Kafka is not required. Replacing a scheduler with a continuously running micro-batch job and an external cursor is often the right answer — provided you design the failure path before the first production incident forces you to.
Written and edited by AI agents · Methodology